Machine Learning Theory and Applications
Machine Learning Theory and Applications
Xavier Vasques
IBM Technology, Bois-Colombes, France
Laboratoire de Recherche en Neurosciences Cliniques, Montferriez sur lez, France
Ecole Nationale Supérieure de Cognitique Bordeaux, Bordeaux, France
Copyright © 2024 by John Wiley & Sons, Inc. All rights reserved.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical,
photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act,
without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright
Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests
to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030,
(201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission.
Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates in the United States
and other countries and may not be used without written permission. All other trademarks are the property of their respective owners. John Wiley &
Sons, Inc. is not associated with any product or vendor mentioned in this book.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no
representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied
warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales
materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where
appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was
written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but
not limited to special, incidental, consequential, or other damages.
For general information on our other products and services or for technical support, please contact our Customer Care Department within the
United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For
more information about Wiley products, visit our web site at www.wiley.com.
Contents
Foreword xiii
Acknowledgments xv
General Introduction xvii
1 Concepts, Libraries, and Essential Tools in Machine Learning and Deep Learning 1
1.1 Learning Styles for Machine Learning 2
1.1.1 Supervised Learning 2
1.1.1.1 Overfitting and Underfitting 3
1.1.1.2 K-Folds Cross-Validation 4
1.1.1.3 Train/Test Split 4
1.1.1.4 Confusion Matrix 5
1.1.1.5 Loss Functions 7
1.1.2 Unsupervised Learning 9
1.1.3 Semi-Supervised Learning 9
1.1.4 Reinforcement Learning 9
1.2 Essential Python Tools for Machine Learning 9
1.2.1 Data Manipulation with Python 10
1.2.2 Python Machine Learning Libraries 10
1.2.2.1 Scikit-learn 10
1.2.2.2 TensorFlow 10
1.2.2.3 Keras 12
1.2.2.4 PyTorch 12
1.2.3 Jupyter Notebook and JupyterLab 13
1.3 HephAIstos for Running Machine Learning on CPUs, GPUs, and QPUs 13
1.3.1 Installation 13
1.3.2 HephAIstos Function 15
1.4 Where to Find the Datasets and Code Examples 32
Further Reading 33
Foreword
The wheels of time turn faster and faster, and as individuals and as human society we all need to adapt and follow. Progress
is uncountable in all domains.
Over the last two years, the author has dedicated many hours over weekends and late evenings to provide a volume of
reference to serve as a guide for all those who plan to travel through machine learning from scratch, and to use it in ela-
borated domains where it could make a real difference, for the good of the people, society, and our planet.
The story of the book started with some blog post series that reached many readers, visits, and interactions. This ini-
tiative was not a surprise for me, knowing the author’s background and following his developments in the fields of sci-
ence and technology. Almost 20 years passed since I first met Xavier Vasques. He was freshly appointed for a PhD in
applied mathematics, but he was still searching for a salient and tangible application of mathematics. Therefore, he
switched to a domain where mathematics was applied to neurosciences and medicine. The topic related to deep brain
stimulation (DBS), a neurosurgical intervention using electric stimulation to modulate dysfunctional brain networks to
alleviate disabling symptoms in neurological and psychiatric conditions. Therapeutic electric field modelling was the
topic of his PhD thesis in Neurosciences. He further completed his training by a master’s in computer science from
The Conservatoire National des Arts et Métiers, and continued his career in IBM where all his skills combined to push
technological development further and fill the gap in many domains through fruitful projects. Neurosciences remained
one of his main interests. He joined the Ecole Polytechnique Fédérale de Lausanne in Switzerland as researcher and
manager of both the Data Analysis and the Brain Atlasing sections for the Blue Brain Project and the Human Brain Proj-
ect in the Neuroinformatics division. Back at IBM, he is currently Vice-President and CTO of IBM Technology and
Research and Development in France.
Throughout his career, Xavier could contemplate the stringent need but also the lack of communication, understanding,
and exchanges between mathematics, computer science, and industry to support technological development. Informatics
use routines and algorithms but we do not know precisely what lies behind it from the mathematical standpoint. Mathe-
maticians do not master codes and coding. Industry involved in production of hardware does not always master some of the
concepts and knowledge from these two domains to favor progress.
The overall intention of this book is to provide a tool for both beginners and advanced users and facilitate translation from
theoretical mathematics to coding, from software to hardware, by understanding and mastering machine learning.
The very personal approach with handwriting, “hand-crafted” figures, and “hands on” approach, makes the book even
more accessible and friendly.
May this book be an opportunity for many people, and a guidance for understanding and bringing forth, in a constructive
and meaningful way, data science solely for the good of mankind in this busy, febrile, and unsteady twenty-first century.
I am writing from the perspective of the clinician who so many times wondered what is a code, an algorithm, supervised
and unsupervised machine learning, deep learning, and so forth.
I see already vocations from very young ages, where future geeks (if the term is still accepted by today’s youth) will be able
to structure their skills and why not push forward the large amount of work, by challenging the author, criticizing and
completing the work.
xiv Foreword
It is always a joy to see somebody achieving. This is the case for this work by Xavier who spared no energy and time to
leave the signature of his curriculum but especially that of his engagement and sharing for today’s society.
Acknowledgments
I would like to express my deepest gratitude to my loving wife and daughter, whose unwavering support and understanding
have been invaluable throughout the journey of writing this book. Their patience, encouragement, and belief in my abilities
have been a constant source of motivation.
To my wife, thank you for your endless love, understanding, and for standing by me during the countless hours spent
researching, writing, and editing. Your unwavering support and belief in my work have been a guiding light.
To my dear daughter, thank you for your patience, understanding, and for being a constant source of inspiration. Your
enthusiasm for learning and exploring new ideas has fueled my passion for this project.
I am truly grateful for the love, understanding, and encouragement that my wife and daughter have provided. Without
them, this book would not have been possible. Their presence in my life has made every step of this journey meaningful and
fulfilling.
Thank you, from the bottom of my heart.
General Introduction
Thomas Hobbes begins his Leviathan by saying, “Reason is nothing but reckoning.” This aphorism implies that we could
behave like machines. The film The Matrix, meanwhile, lets us imagine that we are controlled by an artificial creature in
silico. This machine projects into our brains an imaginary, fictional world that we believe to be real. We are therefore
deceived by calculations and an electrode piercing the back of our skull. The scenarios abound in our imagination. Fiction
suggests to us that one day, it will be easy to replicate our brains, like simple machines, and far from the complexity that we
currently imagine. Any mainstream conference on artificial intelligence (AI) routinely shows an image from The Termina-
tor or 2001: A Space Odyssey.
If “reason is nothing but reckoning,” we could find a mathematical equation that simulates our thinking, our conscious-
ness, and our unconsciousness. This thought is not new. Since the dawn of time, humans have constantly sought to repro-
duce nature. The question of thought, as such, is one of the great questions that humanity has asked itself. What makes
Odysseus able to get away with tricks, flair, imagination, and intuition? How do we reflect, reason, argue, demonstrate,
predict, invent, adapt, make analogies, induce, deduce, or understand? Is there a model that allows us to approach these
things? Throughout our history, we have claimed that a machine cannot calculate like humans or speak, debate, or mul-
titask like humans. Our desire for mechanization over millennia has shown us that machines, tools, and techniques can
accomplish these tasks that we had thought were purely human. Does this mean that machines have surpassed humans? We
can only acquiesce to a wide range of tasks. Are machines human? No!
Since our species emerged, we have continued to create tools intended to improve our daily lives, increase our comfort,
make our tasks less painful, protect us against predators, and discover our world and beyond. These same tools have also
turned against us, even though they had not been endowed with any intelligence. Beyond the use as a tool of AI, the quest for
the thinking machine can be viewed in a slightly different way. It can be seen as a desire to know who we are, or what we are.
It can also be considered as a desire to play God. Since ancient times, philosophers and scientists have been asking these
questions and trying to understand and imitate nature in the hope of giving meaning to our being and sometimes to gain
control. This imitation involves the creation of simple or complex models more or less approaching reality. So it is with the
history of AI. AI comes not only from the history of the evolution of human thought on the body and the nature of the mind
through philosophy, myths, or science but also from the technologies that have accompanied us throughout our history,
from the pebble used to break shells to the supercolliders used to investigate quantum mechanics. Some historians have
found ancient evidence of human interest in artificial creatures, particularly in ancient Egypt, millennia before the coming
of Jesus Christ (BCE). Articulated statues, which could be described as automatons, were used during religious ceremonies
to depict a tradesperson such as a kneading baker or to represent Anubis or Qebehsenouf as a dog’s head with movable jaws.
Even if they are only toys or animated statuettes using screws, levers, or pulleys, we can see a desire to artificially reproduce
humans in action. These objects are not capable of moving on their own, but imagination can help. Automatons may
become a symbol of our progress and emancipatory promises. These advances have also been an opportunity for humans
to question their own humanity. The use of AI has been the subject of many questions and sources of concern about the
future of the human species and its dehumanization.
In Greek mythology, robot servants made by the divine blacksmith Hephaestus lay the foundations for this quest for
artificial creation. Despite the fact that Hephaestus is a god with deformed, twisted, crippled feet, this master of fire is con-
sidered an exceptional craftsman who has created magnificent divine works. A peculiarity of the blacksmith, recounted in
the Iliad, is his ability to create and animate objects capable of moving on their own and imitating life. He is credited with
creating golden servants who assist him in his work and many other automatons with different functions, including guard
xviii General Introduction
dogs to protect the palace of Alkinoos, horses for the chariot of the Cabires, or the giant Talos to guard the island of Crete.
Items crafted by Hephaestus are also clever, allowing the gates of Olympus to open on their own or the bellows of the forge to
work autonomously. The materials such as gold and bronze used to make these artificial creatures offer them immense
resistance and even immortality. These automatons are there to serve the gods and to perform tedious, repetitive, and daunt-
ing tasks to perfection by surpassing mortals. No one can escape the dog forged by Hephaestus, and no one can circum-
navigate Crete three times a day as Talos does. The human dream may have found its origins here. In the time of
Kronos, humans lived with the gods and led a life without suffering, without pain, and without work, because nature pro-
duced abundantly without effort. All you had to do was stretch out your arm to pick up the fruit. The young golden servants
“perfectly embody the wealth, the beauty, the strength, the vitality of this bygone golden age for humans” (J.W. Alexandre
Marcinkowski). This perfect world, without slavery, without thankless tasks, where humans do not experience fatigue and
can dedicate themselves to noble causes, was taken up by certain philosophers including Aristotle, who in a famous passage
from Politics sees in artificial creatures an advantage that is certain:
If every tool, when ordered, or even of its own accord, could do the work that benefits it… then there would be no need
either of apprentices for the master workers or of slaves for the lords.
Aristotle, Politics
We can see in this citation one of the first definitions of AI. Hephaestus does not imitate the living but rather creates it,
which is different from imitation. Blacksmith automatons have intelligence, voice, and strength. His creations do not equal
the gods, who are immortal and impossible to equal. This difference shows a hierarchy between the gods and those living
automatons who are their subordinates. The latter are also superior to humans when considering the perfection of the tasks
that are performed simply, without defects or deception. This superiority is not entirely accurate in the sense that some
humans have shown themselves to be more intelligent than automatons to achieve their ends. We can cite Medea’s over-
coming of Talos. Hephaestus is the only deity capable of creating these wondrous creatures. But these myths lay the founda-
tions of the relationship between humans and technology. Hephaestus is inspired by nature, living beings, and the world. He
makes models that do not threaten the mortal world. These creatures are even prehumans if we think of Pandora. In the
Hellenistic period, machines were created, thanks to scientists and engineers such as Philo of Byzantium or Heron of Alex-
andria. We have seen the appearance of automatic doors that open by themselves at the sound of a trumpet, an automatic
water dispenser, and a machine using the contraction of air or its rarefaction to operate a clock. Many automatons are also
described in the Pneumatika and Automaton-Making by Héron. These automatons amaze but are not considered to produce
things industrially and have no economic or societal impact; these machines make shows. At that time, there was likely no
doubt that we could perhaps imitate nature and provide the illusion but surely not match it, unlike Hephaestus who instead
competes with nature. The works of Hephaestus are perfect, immortal, and capable of “engendering offspring.” When his crea-
tures leave Olympus and join humans, they degenerate and die. Hephaestus, unlike humans, does not imitate the living but
instead manufactures it. Despite thousands of years of stories, myths, attempts, and discoveries, we are still far from Hephaestus.
Nevertheless, our understanding has evolved. We have known for a century that our brain runs on fuel, oxygen, and
glucose. It also works with electricity since neurons transmit what they have to transmit, thanks to electrical phenomena,
using what are called action potentials. Electricity is something we can model. In his time, Galileo said that “nature is a book
written in mathematical language.” So, can we seriously consider the creation of a human brain, thanks to mathematics? To
imagine programming or simulating thought, you must first understand it, take it apart, and break it down. To encode a
reasoning process, you must first be able to decode it. The analysis of this process, or the desire for analysis in any case, has
existed for a very long time.
The concepts of modern computing have their origins in a time when mathematics and logic were two unrelated subjects.
Logic was notably developed, thanks to two philosophers, Plato and Aristotle. We do not necessarily make the connection,
but without Plato, Aristotle, or Galileo, we might not have seen IBM, Microsoft, Amazon, or Google. Mathematics and logic
are the basis of computer science. When AI began to develop, it was assumed that the functioning of thought could be
mechanized. The study of the mechanization of thought or reasoning has a very long history, as Chinese, Indian, and Greek
philosophers had already developed formal deduction methods during the first millennium BCE. Aristotle developed the
formal analysis of what is called the syllogism:
This looks like an algorithm. Euclid, around 300 BCE. J.-C., subsequently wrote the Elements, which develops a formal
model of reasoning. Al-Khwārizmi (born around 780 CE) developed algebra and gave the algorithm its name. Moving for-
ward several centuries, in the seventeenth century the philosophers Leibniz, Hobbes, and Descartes explored the possibility
that all rational thought could be systematically translated into algebra or geometry. In 1936, Alan Turing laid down the
basic principles of computation. This was also a time when mathematicians and logicians worked together and gave birth to
the first machines.
In 1956, the expression “artificial intelligence” was born during a conference at the Dartmouth College in the United
States. Although computers at the time were being used primarily for what was called scientific computing, researchers
John McCarthy and Marvin Minsky used computers for more than just computing; they had big ambitions with AI. Three
years later, they opened the first AI laboratory at MIT. There was considerable investment, great ambitions, and a lot of
unrealized hope at the time. Among the promises? Build a computer that can mimic the human brain. These promises have
not been kept to this day despite some advances: Garry Kasparov was beaten in chess by the IBM Deep Blue machine, IBM’s
Watson AI system defeated the greatest players in the game Jeopardy!, and AlphaGo beat the greatest players in the board
game Go by learning without human intervention. Demis Hassabis, whose goal was to create the best Go player, created
AlphaGo. We learned that we were very bad players, contrary to what we had thought. The game of Go was considered at
that time to be impregnable. In October 2015, AlphaGo became the first program to defeat a professional (the French player
Fan Hui). In March 2016, AlphaGo beat Lee Sedol, one of the best players in the world (ninth dan professional). In May
2017, it defeated world champion Ke Jie.
These are accomplishments. But there are still important differences between human and machine. A machine today can
perform more than 200 million billion operations per second, and it is progressing. By the time you read this book, this figure
will surely have already been exceeded. On the other hand, if there is a fire in the room, Kasparov will take to his heels while
the machine will continue to play chess! Machines are not aware of themselves, and this is important to mention. AI is a tool
that can help us search for information, identify patterns, and process natural language. It is machine learning that allows
elimination of bias or detection of weak signals. The human component involves common sense, morality, creativity, imag-
ination, compassion, abstraction, dilemmas, dreams, generalization, relationships, friendship, and love.
Machine Learning
In this book, we are not going to philosophize but we will explore how to apply machine learning concretely. Machine
learning is a subfield of AI that aims to understand the structure of data and fit that data into models that we can use
for different applications. Since the optimism of the 1950s, smaller subsets of AI such as machine learning, followed by
deep learning, have created much more concrete applications and larger disruptions in our economies and lives.
Machine learning is a very active field, and some considerations are important to keep in mind. This technology is used
anywhere from automating tasks to providing intelligent insights. It concerns every industry, and you are almost certainly
using AI applications without knowing it. We can make predictions, recognize images and speech, perform medical diag-
noses, devise intelligent supply chains, and much more. In this book, we will explore the common machine learning meth-
ods. The reader is expected to understand basic Python programming and libraries such as NumPy or Pandas. We will study
how to prepare data before feeding the models by showing the math and the code using well-known open-source frame-
works. We will also learn how to run these models on not only classical computers (CPU- or GPU-based) but also quantum
computers. We will also learn the basic mathematical concepts behind machine learning models.
One important step in our journey to AI is how we put the models we have trained into production. The best AI companies
have created data hubs to simplify access to governed, curated, and high-quality data. These data are accessible to any user
who is authorized, regardless of where the data or the user is located. It is a kind of self-service architecture for data con-
sumption. The reason we need to consider the notion of a data hub is that we are in a world of companies that have multiple
public clouds, on-premises environments, private clouds, hybrid clouds, distributed clouds, and other platforms.
xx General Introduction
Understanding this world is a key differentiator for a data scientist. This is what we call data governance, which is critical to
an organization if they really want to benefit from AI. How much time do we spend retrieving data?
Another important topic regarding AI is how we ensure that the models we develop are trustworthy. As humans and AI
systems are increasingly working together, it is essential that we trust the output of these systems. As scientists or engineers,
we need to work on defining the dimensions of trusted AI, outlining diverse approaches to achieve the different dimensions,
and determining how to integrate them throughout the entire lifecycle of an AI application.
The topic of AI ethics has garnered broad interest from the media, industry, academia, and government. An AI system
itself is not biased per se but simply learns from whatever the data teaches it. As an example of apparent bias, recent research
has shown significantly higher error rates in image classification for dark-skinned females than for men or other skin tones.
When we write a line of code, it is our duty and our responsibility to make sure that unwanted bias in datasets and machine
learning models does not appear and is anticipated before putting something into production. We cannot ignore that
machine learning models are being used increasingly to inform high-stakes decisions about real people. Although machine
learning, by its very nature, is always a form of statistical discrimination, the discrimination becomes objectionable when it
places certain privileged groups at a systematic advantage and certain unprivileged groups at a systematic disadvantage. Bias
in training data, due to either prejudice in labels or under- or over-sampling, yields models with undesirable results.
I believe that most people are increasingly interested in rights in the workplace, access to health care and education, and
economic, social, and cultural rights. I am convinced that AI can provide us with the opportunity and the choice to improve
these rights. It will improve the way we perform tasks and allow us to focus on what really matters, such as human relations,
and give us the freedom and time to develop our creativity and somehow, as in the past, have time to reflect. AI is already
improving our customer experience. When we think of customer experience, we can consider one of the most powerful
concepts that Mahatma Gandhi mentioned – the concept of Antyodaya, which means focusing on the benefits for the very
last person in a line or a company. When you have to make choices, you must always ask yourself what impact it has on the
very last person. So, how are our decisions on AI or our lines of code going to affect a young patient in a hospital or a young
girl in school? How will AI affect the end user? The main point is about how technology can improve the customer expe-
rience. AI is a technology that will improve our experiences, and I believe it will help us focus on improving humanity and
give us time to develop our creativity.
AI can truly help us manage knowledge. As an example, roughly 160,000 cancer studies are published every year. The
amount of information available in the world is so vast in quantity that a human cannot process this information. If we take
15 minutes to read a research paper, we will need 40,000 hours a year to read 160,000 research papers; we only have 8,760
hours in a year. Would we not want each of our doctors to be able to take advantage of this knowledge and more to learn
about us and how to help us stay healthy or help us deal with illnesses? Cognitive systems can be trained by top doctors and
read enormous amounts of information such as medical notes, MRIs, and scientific research in seconds and improve
research and development by analyzing millions of papers not only from a specific field but also from all related areas
and new ways to treat patients. We can use and share these trained systems to provide access to care for all populations.
This general introduction has aimed to clarify the potential of AI, but if you are reading these lines, it is certainly because
you are already convinced. The real purpose of this book is to introduce you to the world of machine learning by explaining
the main mathematical concepts and applying them to real-world data. Therefore, we will use Python and the most often-
used open-source libraries. We will learn several concepts such as feature rescaling, feature extraction, and feature selection.
We will explore the different ways to manipulate data such as handling missing data, analyzing categorical data, or proces-
sing time-related data. After the study of the different preprocessing strategies, we will approach the most often-used
machine learning algorithms such as support vector machine or neural networks and see them run on classical (CPU-
and GPU-based) or quantum computers. Finally, an important goal of this book is to apply our models into production
in real life through application programming interfaces (APIs) and containerized applications by using Kubernetes and
OpenShift as well as integration through machine learning operations (MLOps).
I would like to take the opportunity here to say a warm thank you to all the data scientists around the world who have
openly shared their knowledge or code through blogs or data science platforms, as well as the open-source community that
has allowed us to improve our knowledge in the field. We are grateful to have all these communities – there is always some-
one who has written about something we need or face.
This book was written with the idea to always have nearby a book that I can open when I need to refresh some machine
learning concepts and reuse some code. All code is available online and the links are provided.
I hope you will enjoy reading this book, and any feedback to improve it is most welcome!
1
Concepts, Libraries, and Essential Tools in Machine Learning and Deep Learning
In this first chapter, we will explore the different concepts in statistical learning as well as popular open-source libraries and
tools. This chapter will serve as an introduction to the field of machine learning for those with a basic mathematical
background and software development skills. In general, machine learning is used to understand the structure of the data
we have at our disposal and fit that data into models to be used for anything from automating tasks to providing intelligent
insights to predicting a behavior. As we will see, machine learning differs from traditional computational approaches, as we
will train our algorithms on our data as opposed to explicitly coded instructions and we will use the output to automate
decision-making processes based on the data we have provided.
Machine learning, deep learning, and neural networks are branches of artificial intelligence (AI) and computer science.
Specifically, deep learning is a subfield of machine learning, and neural networks are a subfield of deep learning. Deep
learning is more focused on feature extraction to enable the use of large datasets. Unlike machine learning, deep learning
does not require human intervention to process data.
Exploration and iterations are necessary for machine learning, and the process is composed of different steps. A typical
machine learning workflow is summarized in Figure 1.1.
In this chapter, we will see concepts that are applied in machine learning, including unsupervised methods such as clus-
tering to group unlabeled data such as K-means or dimensionality reduction to reduce the number of features in order to
better summarize and visualize the data such as principal component analysis, feature extraction to define attributes in
image and text data, feature selection to identify meaningful features to create better supervised models, and cross-
validation to estimate the performance of models on new data or ensemble methods to combine the predictions of multiple
models.
Machine Learning Theory and Applications: Hands-on Use Cases with Python on Classical and Quantum Machines, First Edition. Xavier Vasques.
© 2024 John Wiley & Sons, Inc. Published 2024 by John Wiley & Sons, Inc.
2 1 Concepts, Libraries, and Essential Tools in Machine Learning and Deep Learning
Iterate
Data management
The literature describes different types of machine learning algorithms classified into categories. Depending on the way
we provide information to the learning system or on whether we provide feedback on the learning, these types fall into
categories of supervised learning, unsupervised learning, or reinforcement learning.
• Binary classification: The algorithm will classify the data into two categories such as spam or not spam. The label space
is {0, 1} or {−1, 1}.
• Multi-class classification: The algorithm needs to choose among more than two types of answers for a target variable
such as the recognition of images containing animals (e.g., dog = 1, cat = 2, fish = 3, etc.). If we have N image classes, we
have C = {1, 2, …, N}.
• Regression: Regression models predict continuous variables as opposed to classification models, which consider cate-
gorical variables. For example, if we attempt to predict measures such as temperature, net profit, or the weight of a person,
this will require regression models. Here, C = .
1.1 Learning Styles for Machine Learning 3
In supervised learning, we can find popular classification models such as decision trees, support vector machine, naïve
Bayes classifiers, random forest, or neural networks. We can also find popular regression models such as linear regression,
ridge regression, ordinary least squares regression, or stepwise regression.
d
As mentioned above, we need to find a function f C. This requires some steps such as making some assumptions
regarding what the function f looks like and what space of functions we will be using (linear, decision trees, polynomials,
etc.). This is what we call the hypothesis space . This choice is very important because it impacts how our model will
generalize to completely new data that has not been used for training.
Classification
Regression
Deep learning
Error Error Error
Training
Validation
Figure 1.2 Underfitting, optimal fitting, and overfitting in classification, regression, and deep learning.
4 1 Concepts, Libraries, and Essential Tools in Machine Learning and Deep Learning
between underfitting or bias and overfitting or variance. Bias means a prediction error that is introduced in the algorithm
due to oversimplification or the differences between the predicted values and the actual values. Variance occurs when the
model performs well with training data but not with test data. We should also introduce two other words: signal and noise.
Signal refers to the true underlying pattern of the data that allows the algorithm to learn from data, whereas noise is irrel-
evant and unnecessary data that reduces the performance of the algorithm.
In overfitting, the machine learning model will attempt to cover more than the necessary data points present in a dataset
or simply all data points. It is a modeling error in statistics because the function is too closely aligned to the training dataset
and will certainly translate to a reduction in efficiency and accuracy with new data because the model has considered
inaccurate values in the dataset. Models that are overfitted have low bias and high variance. For example, overfitting
can come when we train our model excessively. To reduce overfitting, we can perform cross-validation, regularization,
or ensembling, train our model with more data, remove unnecessary features, or stop the training of the model earlier.
We will see all these techniques.
On the opposite, we have underfitting, with high bias and low variance, which occurs when our model is not able
to capture the underlying trend of the data, generating a high error rate on both the training dataset and new data. An
underfitted model is not able to capture the relationships between input and output variables accurately and produces
unreliable predictions. In other words, it does not generalize well to new data. To avoid underfitting, we can decrease
the regularization used to reduce the variance, increase the duration of the training, and perform feature selection.
Some models are more prone to overfitting than others such as KNN or decision trees. The goal of machine learning is
to achieve “goodness of fit,” which is a term based on the statistics that define how closely the results or predicted
values match the true values of the dataset. As we can imagine, the ideal fit of our model is between underfitting
and overfitting making predictions with zero errors. As we will see during our journey to mastering machine learning,
this goal is difficult to achieve. In addition, overfitting can be more difficult to identify than underfitting because the
training data perform at high accuracy. We can assess the accuracy of our model by using a method called k-folds cross-
validation.
Evaluation 1
Evaluation 2
Evaluation 3
Evaluation 4
Evaluation 5
Figure 1.3 Dataset divided into five subgroups for fivefold cross-validation.
the training dataset. The more the validation dataset is incorporated into the model
Actual values
configuration, the more the evaluation becomes biased. The test dataset is used to
Positive Negative
provide an unbiased evaluation of a final model; it is only used when the model has
P
been completely trained using training and validation datasets. It is important to o
carefully curate the test dataset in order to represent the real world. s
i
t TP FP
Predicted values
i
v
1.1.1.4 Confusion Matrix e
The confusion matrix is another important concept in machine learning used to
determine the performance of a classification model for a given set of test data. It shows N
e
the errors in the form of an N × N matrix (error matrix) where N is the number of target g
a
classes. It compares the actual target values with those predicted by the machine
t FN TN
learning algorithm. For example, for a binary classification problem, we will have a i
v
2 × 2 matrix; for three classes, we will have a 3 × 3 table; and so on. e
In Figure 1.4, the matrix contains two dimensions, representing predicted values and
actual values, along with the total number of predictions. The target variable has two Figure 1.4 Confusion matrix.
values (positive or negative); the columns represent the actual values and the rows the
predicted values of the target variable. Inside the matrix, we have the following
possibilities:
• True positive (TP): The actual value was positive, and the model predicted a
positive value.
Actual values
Positive Negative
• True negative (TN): The actual value was negative, and the model predicted a
negative value.
P
o
•
s
False positive (FP): The actual value was negative, but the model predicted i
a positive value (the predicted value was falsely predicted, also known as type t 560 60
Predicted values
i
1 error).
•
v
e
False negative (FN): The actual value was positive, but the model predicted a neg-
ative value (the predicted value was falsely predicted, also known as type 2 error). N
e
Let us take an example (Figure 1.5) with a classification dataset of 1000 data points g
a
with a fitted classifier that has provided the confusion matrix shown. In Figure 1.5, t 50 330
we see that 560 positive-class data points were correctly classified by the model, i
v
330 negative-class data points were correctly classified, 60 negative-class data points e
were not correctly classified as belonging to the positive class, and 50 positive-class data
points were not classified correctly as belonging to the negative class by the model. Figure 1.5 An example of a
We can conclude that the classifier is acceptable. confusion matrix.
6 1 Concepts, Libraries, and Essential Tools in Machine Learning and Deep Learning
To make things more concrete visually, let us say we have a set of 10 people and we want to build a model to predict
whether they are sick. Depending on the predicted value, the outcome can be TP, TN, FP, or FN:
ID Actual sick Predicted sick Confusion matrix
1 1 1 TP
2 1 1 TP
3 1 0 FP
4 0 1 FN
5 0 1 FN
6 0 0 TN
7 1 1 TP
8 0 0 TN
9 1 0 FP
10 0 1 FN
With the help of a confusion matrix, it is possible to calculate a measure of performance such as accuracy, precision, recall,
or others.
Classification Accuracy Classification accuracy is one of the most important parameters to assess the accuracy of a
classification problem. It simply indicates how often a model predicts the correct output. It is calculated as the ratio of
the number of correct predictions to the total number of predictions made by the classifiers:
TP + TN
Accuracy =
TP + TN + FP + FN
Misclassification Rate The misclassification rate, also known as error rate, defines how often the model provides incorrect
predictions. To calculate the error rate, we compute the number of incorrect predictions to the total number of predictions
made by the classification model:
FP + FN
Error rate =
TP + TN + FP + FN
Precision and Recall Precision is the number of correct outputs, or how many of the predicted positive cases were truly
positive. This is a way to estimate whether our model is reliable.
TP
Precision =
TP + FP
Recall defines how our model predicted correctly out of total positive classes. In other words, how many of the actual
positive cases were predicted correctly:
TP
Recall =
TP + FN
F-Score The traditional F-measure or balanced F-score (F1 score) is used to evaluate recall and precision at the same time. It
is a harmonic mean of precision and recall. The F-score is highest when precision and recall are equal.
2 precision recall
F1 = =2×
recall − 1 + precision − 1 precision + recall
The F-score is widely used in the natural language processing literature (name entity recognition, word
segmentation, etc.).
1.1 Learning Styles for Machine Learning 7
The metrics described above are the most widely used ones. There are a number of other important metrics that we can
explore to fit our context, including the Fowlkes–Mallows index, Matthews correlation coefficient, Jaccard index, diagnostic
odds ratio, and others. We can find all these metrics in the literature regarding diagnostic testing in machine learning.
To provide a simple example of code, let us say that we wish to build a supervised machine learning model called linear
discriminant analysis (LDA) and print the classification accuracy, precision, recall, and F1 score. Before doing this, we have
previously split the dataset into training data (X_train and y_train) and testing data (X_test and y_test). We want to apply
cross-validation (k = 5) and print the scores. To run this code, we need to install scikit-learn, as described later in this
chapter.
Input:
1 n
arg min L h = arg min l xi, y h
h h ni=1
where we use the loss function L of a hypothesis h given D and l for the loss of a single data pair (xi, y) given h. There are
different types of loss functions such as squared error loss, absolute error loss, Huber loss, or hinge loss. For clarity, it is also
important to mention the cost function, which is the average loss over the entire training dataset; in contrast, the loss func-
tion is computed for a single training example. The loss function will have a higher value if the predictions do not appear
8 1 Concepts, Libraries, and Essential Tools in Machine Learning and Deep Learning
accurate and a lower value if the model performs fairly well. The cost function quantifies the error between predicted values
and expected values and presents it in the form of a single number. The purpose of the cost function is to be either minimized
(cost, loss, or error) or maximized (reward). Cost and loss refer almost to the same concept, but the cost function invokes a
penalty for a number of training sets or the complete batch, whereas the loss function mainly applies to a single training set.
The cost function is computed as an average of the loss functions and is calculated once; in contrast, the loss function is
calculated at every instance.
For example, in linear regression, the loss function used is the squared error loss and the cost function is the mean squared
error (MSE). The squared error loss, also known as L2 loss, is the square of the difference between the actual and the
predicted values:
2
L = y−f x
The corresponding cost function is the mean of these squared errors. The overall loss is then:
1 n 2
MSE = y − f xi
ni=1 i
Another regression loss function is the absolute error loss, also known as the L1 loss, which is the difference between the
predicted and the actual values, irrespective of the sign:
L = y−f x
The cost is the mean of the absolute errors (MAE), which is more robust than MSE regarding outliers.
Another example is the Huber loss, which combines the MSE and MAE by taking a quadratic form for smaller errors and a
linear form otherwise:
1 2
y−f x , if y − f x ≤ δ
Huber = 2
1
δ y − f x − δ2 , otherwise
2
In the formula above, Huber loss is defined by a δ parameter; it is usually used in robust regression or m-estimation and is
more robust to outliers than MSE.
For binary classification, we can use the binary cross-entropy loss (also called log loss), which is measured for a random
variable X with probability distribution p(X):
If there is greater uncertainty in the distribution, the entropy for a probability distribution will be greater.
The hinge loss function is very popular in support vector machines (SVMs) with class labels 1 and −1:
L = max 0, 1 − y ∗ f x
Finally, for multi-class classification, we can use the categorical cross-entropy loss, also called softmax loss. It is a
softmax activation plus a cross-entropy loss. In a multi-label classification problem, the target represents multiple classes
at once. In this case, we calculate the binary cross-entropy loss for each class separately and then sum them up for the
complete loss.
We can find numerous methodologies that we can explore to better optimize our models.
1.2 Essential Python Tools for Machine Learning 9
Machine learning
Classification Model-
Clustering
free
Dimensionality
reduction
I will not present very much on Python (https://www.python.org) itself in this section, but I must say that the language of
data scientists is indeed Python! This platform is widely used for machine learning applications and is clearly a common
choice across academia and industry. Python is easy to learn and to read, and its syntax is accessible to beginners. What
also makes Python powerful is its huge community of developers and data scientists who make Python easier for beginners
by sharing open-source projects, libraries, tutorials, and other examples. I am grateful to this community, which has helped
me progress.
10 1 Concepts, Libraries, and Essential Tools in Machine Learning and Deep Learning
import sklearn
We will see in Chapter 2 how to use sklearn. One of the great things about scikit-learn is that it is very well documented,
with many code examples. Python is the interface, but C libraries are leveraged for performance such as NumPy for arrays
and matrix operations. In addition, scikit-learn provides several in-built datasets that are easy to import.
1.2.2.2 TensorFlow
TensorFlow (https://www.tensorflow.org) is a Python library created by Google in late 2015. At first, Google used it for
internal needs and open-sourced it (Apache open-source license) to the community. Since then, TensorFlow has become
one of the largest Python libraries in the open-source machine learning domain. We all know Gmail, Google Search, and
YouTube; these applications utilize TensorFlow. One of the advantages of TensorFlow is that it examines data in tensors,
which are multi-dimensional arrays that can process large amounts of data. All library actions are performed within a graph
(graph-based architecture) made of a sequence of computations, all interconnected. Another advantage is that TensorFlow
1.2 Essential Python Tools for Machine Learning 11
is designed to run on numerous CPUs, GPUs, or mobile devices. It is easy to execute the code in a distributed way across
clusters and to use GPUs to dramatically improve the performance of training. The graphs are portable, allowing saving of
computations and executing them at our conveyance. We can divide the operations of TensorFlow into three parts: prepar-
ing the data, model creation, and training and estimation of the model.
TensorFlow includes different models for computer vision (image or video classification, object detection, and segmen-
tation), natural language processing, or recommendations. To perform these operations, TensorFlow uses models such as
deep residual learning for image recognition (ResNet), Mask-CNN, ALBERT (a light Bidirectional Encoder Representations
from Transformers [BERT] for self-supervised learning of language representations), or neural collaborative filtering.
Installation of TensorFlow is straightforward (with pip):
https://www.tensorflow.org/install/pip#system-install
As mentioned above, one of the main reasons to run TensorFlow is the use of GPUs. On a computer or cluster running
with GPUs, it will be easy to run algorithms using TensorFlow. It will be necessary to install a CUDA toolkit and libraries
(cuDNN libraries) that need to interact with GPUs.
Let us view an example by installing cuDNN and the CUDA toolkit on Ubuntu 20.04. To install the CUDA toolkit, we can
apply the following instructions in a terminal:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/
cuda-ubuntu2004.pin
sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/11.3.0/local_installers/
cuda-repo-ubuntu2004-11-3-local_11.3.0-465.19.01-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2004-11-3-local_11.3.0-465.19.01-1_amd64.deb
We can now check whether nvidia-persistenced is running as a daemon from system initialization:
cat /proc/driver/nvidia/version
To install the cuDNN libraries, we need to download the versions that correspond to our environment (https://developer.
nvidia.com/cudnn). For a local computer running Ubuntu 20.04, they are the following:
• cuDNN code samples and user guide for Ubuntu20.04 x86_64 (Deb).
12 1 Concepts, Libraries, and Essential Tools in Machine Learning and Deep Learning
•• libcudnn8-samples_8.2.0.53–1+cuda11.3_amd64.deb.
libcudnn8-dev_8.2.0.53–1+cuda11.3_amd64.deb.
• libcudnn8_8.2.0.53–1+cuda11.3_amd64.deb.
Then, we can run the command below in our terminal to see whether cuDNN is communicating with the NVIDIA
driver:
nvidia-smi
To test the installation, we simply type the following commands in our terminal; we should see “Test passed”:
cp -r /usr/src/cudnn_samples_v8/ $HOME
cd $HOME/cudnn_samples_v8/mnistCUDNN/
make
./mnistCUDNN
1.2.2.3 Keras
Keras (https://keras.io) is a Python framework that provides a library with an emphasis on deep learning applications.
Keras can run on top of TensorFlow, Theano, a Python library and optimizing compiler for manipulating and evaluating
mathematical expressions, and enables fast experimentation through a user-friendly interface. It runs on both CPUs and
GPUs. Keras is mainly used for computer vision using convolutional neural networks or for sequence and time series using
recurrent neural networks. We can also use Keras applications, in which we can find deep learning models that are pre-
trained. The models can be applied for feature extraction, fine-tuning, or prediction. The framework also provides different
datasets such as the MNIST dataset containing 70,000 28 × 28 grayscale images with 10 different classes that we can load
directly using Keras.
1.2.2.4 PyTorch
PyTorch (https://pytorch.org), released to open-source in 2017, is a Python machine learning library based on the Torch
machine learning library with additional features and functionalities, making the deployment of machine learning models
faster and easier. In other words, it combines the GPU-accelerated backend libraries from Torch with a user-friendly Python
frontend. PyTorch is written in Python and integrated with libraries such as NumPy, SciPy, and Cython for better
performance.
PyTorch contains many features for data analysis and preprocessing. Facebook has developed a machine learning library
called Caffe2 that has been merged into PyTorch; thus, Caffe is now a portion of PyTorch. In addition, PyTorch has an
ecosystem of libraries such as skorch (sklearn compatibility), Captum (model interpretability), or Glow, a compiler and
execution engine for hardware accelerators.
Of course, there are many more machine learning frameworks in the open-source world such as OpenCV or R; we would
need more than a book to describe all of them. But with scikit-learn, TensorFlow, Keras, and PyTorch, much can already be
done! Beyond the open-source frameworks, it is also important to consider how we can improve AI lifecycle management,
unite different teams around the data, and accelerate the time to value creation.
1.3 HephAIstos for Running Machine Learning on CPUs, GPUs, and QPUs 13
Then, we access the Notebook from the remote machine over SSH by setting up an SSH tunnel:
# Replace <PORT> with the port number you selected in the above step
# Replace <REMOTE_USER> with the remote server username
# Replace <REMOTE_HOST> with your remote server address
ssh -L 8080:localhost:<PORT> <REMOTE_USER>@<REMOTE_HOST>
Finally, we can open a browser and navigate, thanks to the links provided in the terminal.
1.3 HephAIstos for Running Machine Learning on CPUs, GPUs, and QPUs
In this book, we introduce hephAIstos, an open-source Python framework designed to execute machine learning pipelines
on CPUs, GPUs, and quantum processing units (QPUs). The framework incorporates various libraries, including scikit-
Learn, Keras with TensorFlow, and Qiskit, as well as custom code.
You may choose to skip this section if you prefer diving directly into the concepts of machine learning or if you already
have ideas you would like to experiment with. Throughout the book, we will integrate hephAIstos code examples to dem-
onstrate the concepts discussed.
Our aim is to simplify the application of the techniques explored in this book. You can either learn to create your own
pipelines or utilize hephAIstos to streamline the process. Feel free to explore this section now if you already possess knowl-
edge of machine learning and Python. Alternatively, you can return to it later, as we will incorporate hephAIstos pipelines in
various examples throughout the book. HephAIstos is distributed under the Apache License, Version 2.0.
We encourage contributions to enhance the framework. You can create pipelines using Python functions with parameters
or employ specific routines.
Find hephAIstos on GitHub: https://github.com/xaviervasques/hephaistos.git
1.3.1 Installation
To install hephAIstos, you can clone it from GitHub. You can either download the repository directly or use the following
command in your terminal:
Next, navigate to the hephAIstos directory and install the required dependencies:
•• Python
joblib
•• numpy
scipy
•• pandas
scikit-learn
•• category_encoders
hashlib
•• matplotlib
tensorflow
•• qiskit
qiskit_machine_learning
The package includes several datasets for your convenience. Throughout this book, we will delve into various machine
learning methods and provide code snippets to demonstrate how hephAIstos operates. The framework supports an array of
techniques, as listed below:
• Feature rescaling: StandardScaler, MinMaxScaler, MaxAbsScaler, RobustScaler, unit vector normalization, log
transformation, square root transformation, reciprocal transformation, Box-Cox, Yeo-Johnson, quantile Gaussian, and
quantile uniform.
• Categorical data encoding: ordinal encoding, one hot encoding, label encoding, Helmert encoding, binary encoding,
frequency encoding, mean encoding, sum encoding, weight of evidence encoding, probability ratio encoding, hashing
encoding, backward difference encoding, leave one out encoding, James–Stein encoding, and M-estimator.
•• Time-related feature engineering: time split (year, month, seconds, etc.), lag, rolling window, and expanding window.
Missing values: row/column removal, statistical imputation (mean, median, mode), linear interpolation, multivariate
imputation by chained equation (MICE) imputation, and KNN imputation.
• Feature extraction: principal component analysis, independent component analysis, linear discriminant analysis,
locally linear embedding, t-distributed stochastic neighbor embedding, and manifold learning techniques.
• Feature selection: filter methods (variance threshold, statistical tests, chi-square test, ANOVA F-value, Pearson corre-
lation coefficient), wrapper methods (forward stepwise selection, backward elimination, exhaustive feature selection),
and embedded methods (least absolute shrinkage and selection operator, ridge regression, elastic net, regularization
embedded into ML algorithms, tree-based feature importance, permutation feature importance).
• Classification algorithms running on CPUs: support vector machine with linear, radial basis function, sigmoid and
polynomial kernel functions (svm_linear, svm_rbf, svm_sigmoid, svm_poly), multinomial logistic regression
(logistic_regression), linear discriminant analysis (lda), quadratic discriminant analysis (qda), Gaussian naive Bayes
(gnb), multinomial naive Bayes (mnb), K-neighbors naive Bayes (kneighbors), stochastic gradient descent (sgd),
nearest centroid classifier (nearest_centroid), decision tree classifier (decision_tree), random forest classifier
(random_forest), extremely randomized trees (extra_trees), multi-layer perceptron classifier (mlp_neural_network),
and multi-layer perceptron classifier to run automatically different hyperparameters combinations and return the best
result (mlp_neural_network_auto).
• Regression algorithms running on CPUs: linear regression (linear_regression), SVR with linear kernel (svr_linear),
SVR with radial basis function (RBF) kernel (svr_rbf ), SVR with sigmoid kernel (svr_sigmoid), SVR with polynomial
1.3 HephAIstos for Running Machine Learning on CPUs, GPUs, and QPUs 15
kernel (svr_poly), multi-layer perceptron for regression (mlp_regression), and multi-layer perceptron to run automati-
cally different hyperparameters combinations and return the best result (mlp_auto_regression).
• Save results
– output_folder: To save figures, results, or inference models to an output folder, set the path of an output folder to where
you want to save the results of the pipeline (such as the metrics accuracy of the models) in a .csv file.
– Let us take an example. Create a Python file in hephAIstos, write the following lines, and execute it:
# Import dataset
from data.datasets import neurons
df = neurons() # Load the neurons dataset
# Run ML Pipeline
ml_pipeline_function(df, output_folder='./Outputs/')
# Execute the ML pipeline with the loaded dataset and store the output in the './Outputs/' folder
# Import dataset
from data.datasets import neurons
df = neurons() # Load the neurons dataset
# Run ML Pipeline with row removal for handling missing values and 20% test set size
ml_pipeline_function(df, output_folder='./Outputs/', test_size=0.2)
# Execute the ML pipeline with the loaded dataset, split the data into train and test
sets with a test size of 20%, and store the output in the './Outputs/' folder
– test_time_size: If the dataset is a time series dataset, we do not use test_size but test_time_size instead. If we choose
test_time_size = 1000, it will take the last 1000 values of the dataset for testing.
– time_feature_name: This is the name of the feature containing the time series.
16 1 Concepts, Libraries, and Essential Tools in Machine Learning and Deep Learning
– time_split: This is used to split the time variable by year, month, minutes, and seconds as described in
https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.year.html.
Available options are “year,” “month,” “hour,” “minute,” “second.”
– time_format: This is the strftime to parse time, for example, “%d/%m/%Y.” See the strftime documentation for more
information and the different options:
https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior.
For example, if the data are “1981-7-1 15:44:31,” a format would be “%Y-%d-%m %H:%M:%S.”
– Example:
# Import Dataset
from data.datasets import DailyDelhiClimateTrain
df = DailyDelhiClimateTrain() # Load the DailyDelhiClimateTrain dataset
df = df.rename(columns={"meantemp": "Target"}) # Rename the 'meantemp' column to 'Target'
# Run ML Pipeline
ml_pipeline_function(
df,
output_folder='./Outputs/', # Store the output in the './Outputs/' folder
missing_method='row_removal', # Remove rows with missing values
test_time_size=365, # Set the test time size to 365 days
time_feature_name='date', # Set the time feature name to 'date'
time_format="%Y-%m-%d", # Define the time format as "%Y-%m-%d"
time_split=['year', 'month', 'day'] # Split the time feature into 'year', 'month',
and 'day' columns
)
# Execute the ML pipeline with the loaded dataset, removing rows with missing values,
# using a test set of 365 days and splitting the 'date' feature into 'year', 'month',
and 'day' columns.
– Example:
# Run ML Pipeline
ml_pipeline_function(
df,
output_folder='./Outputs/', # Store the output in the './Outputs/' folder
missing_method='row_removal', # Remove rows with missing values
test_time_size=365, # Set the test time size to 365 days
time_feature_name='date', # Set the time feature name to 'date'
time_format="%Y-%m-%d", # Define the time format as "%Y-%m-%d"
time_split=['year', 'month', 'day'], # Split the time feature into 'year', 'month',
and 'day' columns
time_transformation='lag', # Apply the lag transformation
number_of_lags=2, # Set the number of lags to 2
lagged_features=['wind_speed', 'meanpressure'], # Set the lagged features to
'wind_speed' and 'meanpressure'
lag_aggregation=['min', 'mean'] # Set the aggregation methods for the lagged
features to 'min' and 'mean'
)
# Execute the ML pipeline with the loaded dataset, removing rows with missing values,
# using a test set of 365 days, splitting the 'date' feature into 'year', 'month', and
'day' columns,
# applying the lag transformation with 2 lags for 'wind_speed' and 'meanpressure', and
aggregating them with 'min' and 'mean'.
• Categorical data
– categorical: If the dataset is composed of categorical data that are labeled with text, we can select data encoding methods.
The following options are available: “ordinal_encoding,” “one_hot_encoding,” “label_encoding,” “helmert_encoding,”
“binary_encoding,” “frequency_encoding,” “mean_encoding,” “sum_encoding,” “weightofevidence_encoding,”
“probability_ratio_encoding,” “hashing_encoding,” “backward_difference_encoding,” “leave_one_out_encoding,”
“james_stein_encoding,” and “m_estimator_encoding.” Different encoding methods can be combined.
– We need to select the features that we want to encode with the specified method. For this, we indicate the features we
want to encode for each method:
◦ features_ordinal
◦ features_one_hot
◦ features_label
◦ features_helmert
◦ features_binary
◦ features_frequency
◦ features_mean
◦ features_sum
◦ features_weight
◦ features_proba_ratio
◦ features_hashing
◦ features_backward
18 1 Concepts, Libraries, and Essential Tools in Machine Learning and Deep Learning
◦ features_leave_one_out
◦ features_james_stein
◦ features_m
– Example:
# Run ML Pipeline
ml_pipeline_function(
df,
output_folder='./Outputs/', # Store the output in the './Outputs/' folder
missing_method='row_removal', # Remove rows with missing values
test_size=0.2, # Set the test set size to 20% of the dataset
categorical=['binary_encoding', 'label_encoding'], # Apply binary and label encoding
for categorical features
features_binary=['smoker', 'sex'], # Apply binary encoding for 'smoker' and 'sex'
features
features_label=['region'] # Apply label encoding for the 'region' feature
)
# Execute the ML pipeline with the loaded dataset, removing rows with missing values,
# using a test set of 20%, and applying binary encoding for 'smoker' and 'sex'
features and label encoding for the 'region' feature.
• Data rescaling
– rescaling: We can include a data rescaling method. The following options are available:
◦ standard_scaler
◦ minmax_scaler
◦ maxabs_scaler
◦ robust_scaler
◦ normalizer
◦ log_transformation
◦ square_root_transformation
◦ reciprocal_transformation
◦ box_cox
◦ yeo_johnson
◦ quantile_gaussian
◦ quantile_uniform
– Example:
# Import Data
from data.datasets import neurons
df = neurons() # Load the neurons dataset
1.3 HephAIstos for Running Machine Learning on CPUs, GPUs, and QPUs 19
# Run ML Pipeline
ml_pipeline_function(
df,
output_folder='./Outputs/', # Store the output in the './Outputs/' folder
missing_method='row_removal', # Remove rows with missing values
test_size=0.2, # Set the test set size to 20% of the dataset
categorical=['label_encoding'], # Apply label encoding for categorical features
features_label=['Target'], # Apply label encoding for the 'Target' feature
rescaling='standard_scaler' # Apply standard scaling to the features
)
# Execute the ML pipeline with the loaded dataset, removing rows with missing values,
# using a test set of 20%, applying label encoding for the 'Target' feature, and
rescaling the features using the standard scaler.
• Feature extraction
– feature_extraction: This option selects the method of feature extraction. The following choices are available:
◦ pca
◦ ica
◦ icawithpca
◦ lda_extraction
◦ random_projection
◦ truncatedSVD
◦ isomap
◦ standard_lle
◦ modified_lle
◦ hessian_lle
◦ ltsa_lle
◦ mds
◦ spectral
◦ tsne
◦ nca
– number_components: This is the number of principal components we want to keep for PCA, ICA, LDA, or other
purposes.
– n_neighbors: This is the number of neighbors to consider for manifold learning techniques.
– Example:
# Import Data
from data.datasets import neurons
df = neurons() # Load the neurons dataset
# Run ML Pipeline
ml_pipeline_function(
df,
output_folder='./Outputs/', # Store the output in the './Outputs/' folder
missing_method='row_removal', # Remove rows with missing values
test_size=0.2, # Set the test set size to 20% of the dataset
categorical=['label_encoding'], # Apply label encoding for categorical features
20 1 Concepts, Libraries, and Essential Tools in Machine Learning and Deep Learning
• Feature selection
– feature_selection: Here we can select a feature selection method (filter, wrapper, or embedded):
◦ The following filter options are available:
▪ variance_threshold: Apply a variance threshold. If we choose this option, we also need to indicate the features we
want to process (features_to_process= [‘feature_1’, ‘feature_2’, …]) and the threshold (var_threshold = 0 or any
number).
▪ chi2: Perform a chi-squared test on the samples and retrieve only the k-best features. We can define k with the
k_features parameter.
▪ anova_f_c: Create a SelectKBest object to select features with the k-best ANOVA F-values for classification. We
can define k with the k_features parameter.
▪ anova_f_r: Create a SelectKBest object to select features with the k-best ANOVA F-values for regression. We can
define k with the k_features parameter.
▪ pearson: The main idea for feature selection is to retain the variables that are highly correlated with the target and
keep features that are uncorrelated among themselves. The Pearson correlation coefficient between features is
defined by cc_features and that between features and the target by cc_target.
– Examples:
# Import Data
from data.datasets import neurons
df = neurons() # Load the neurons dataset
# Run ML Pipeline
ml_pipeline_function(
df,
output_folder='./Outputs/', # Store the output in the './Outputs/' folder
missing_method='row_removal', # Remove rows with missing values
test_size=0.2, # Set the test set size to 20% of the dataset
categorical=['label_encoding'], # Apply label encoding for categorical features
features_label=['Target'], # Apply label encoding for the 'Target' feature
rescaling='standard_scaler', # Apply standard scaling to the features
feature_selection='pearson', # Apply Pearson correlation-based feature selection
cc_features=0.7, # Set the correlation coefficient threshold for pairwise feature
correlation to 0.7
cc_target=0.7 # Set the correlation coefficient threshold for correlation with the
target variable to 0.7
)
# Execute the ML pipeline with the loaded dataset, removing rows with missing values,
1.3 HephAIstos for Running Machine Learning on CPUs, GPUs, and QPUs 21
# using a test set of 20%, applying label encoding for the 'Target' feature, rescaling
the features using the standard scaler,
# and performing feature selection based on Pearson correlation with thresholds of 0.7
for pairwise feature correlation and correlation with the target variable.
or
# Import Data
from data.datasets import neurons
df = neurons() # Load the neurons dataset
# Run ML Pipeline
ml_pipeline_function(
df,
output_folder='./Outputs/', # Store the output in the './Outputs/' folder
missing_method='row_removal', # Remove rows with missing values
test_size=0.2, # Set the test set size to 20% of the dataset
categorical=['label_encoding'], # Apply label encoding for categorical features
features_label=['Target'], # Apply label encoding for the 'Target' feature
rescaling='standard_scaler', # Apply standard scaling to the features
feature_selection='anova_f_c', # Apply ANOVA F-test based feature selection
k_features=2 # Select the top 2 features based on their F-scores
)
# Execute the ML pipeline with the loaded dataset, removing rows with missing values,
# using a test set of 20%, applying label encoding for the 'Target' feature, rescaling
the features using the standard scaler,
# and performing feature selection based on ANOVA F-test, selecting the top 2 features.
◦ Wrapper methods: The following options are available for feature_selection: “forward_stepwise,”
“backward_elimination,” and “exhaustive.”
▪ wrapper_classifier: In wrapper methods, we need to select a classifier or regressor. Here, we can choose one from
scikit-learn such as KneighborsClassifier(), RandomForestClassifier, LinearRegression, or others and apply it to
forward stepwise (forward_stepwise), backward elimination (backward_elimination), or exhaustive (exhaustive)
methods.
▪ min_features and max_features are attributes for exhaustive option to specify the minimum and maximum
number of features we desire in the combination.
– Example:
# Import Data
df = breastcancer() # Load the breast cancer dataset
df = df.drop(["id"], axis=1) # Drop the 'id' column
# Run ML Pipeline
ml_pipeline_function(
22 1 Concepts, Libraries, and Essential Tools in Machine Learning and Deep Learning
df,
output_folder='./Outputs/', # Store the output in the './Outputs/' folder
missing_method='row_removal', # Remove rows with missing values
test_size=0.2, # Set the test set size to 20% of the dataset
categorical=['label_encoding'], # Apply label encoding for categorical features
features_label=['Target'], # Apply label encoding for the 'Target' feature
rescaling='standard_scaler', # Apply standard scaling to the features
feature_selection='backward_elimination', # Apply backward elimination for feature
selection
wrapper_classifier=KNeighborsClassifier(), # Use K-nearest neighbors classifier
for the wrapper method in backward elimination
k_features=2 # Select the top 2 features
)
# Execute the ML pipeline with the loaded dataset, removing rows with missing values,
# using a test set of 20%, applying label encoding for the 'Target' feature, rescaling
the features using the standard scaler,
# and performing feature selection using backward elimination with a K-nearest
neighbors classifier.
◦ Embedded methods:
▪ feature_selection: We can select several methods.
– lasso: If we choose lasso, we need to add the alpha parameter (lasso_alpha).
– feat_reg_ml: Allows selection of features with regularization embedded into machine learning algorithms. We need to
select the machine learning algorithms (in scikit-learn) by setting the parameter ml_penalty:
◦ embedded_linear_regression
◦ embedded_logistic_regression
◦ embedded_decision_tree_regressor
◦ embedded_decision_tree_classifier
◦ embedded_random_forest_regressor
◦ embedded_random_forest_classifier
◦ embedded_permutation_regression
◦ embedded_permutation_classification
◦ embedded_xgboost_regression
◦ embedded_xgboost_classification
– Example:
# Import Data
df = breastcancer() # Load the breast cancer dataset
df = df.drop(["id"], axis=1) # Drop the 'id' column
# Run ML Pipeline
ml_pipeline_function(
df,
output_folder='./Outputs/', # Store the output in the './Outputs/' folder
missing_method='row_removal', # Remove rows with missing values
test_size=0.2, # Set the test set size to 20% of the dataset
categorical=['label_encoding'], # Apply label encoding for categorical features
1.3 HephAIstos for Running Machine Learning on CPUs, GPUs, and QPUs 23
• Classification algorithms
– Classification_algorithms: The following classification algorithms are used only with CPUs:
◦ svm_linear
◦ svm_rbf
◦ svm_sigmoid
◦ svm_poly
◦ logistic_regression
◦ lda
◦ qda
◦ gnb
◦ mnb
◦ k-neighbors
▪ For k-neighbors, we need to add an additional parameter that indicates the number of neighbors (n_neighbors).
◦ sgd
◦ nearest_centroid
◦ decision_tree
◦ random_forest
▪ For random_forest, we can optionally add the number of estimators (n_estimators_forest).
◦ extra_trees
▪ For extra_trees, we add the number of estimators (n_estimators_forest).
◦ mlp_neural_network
▪ The following parameters are available: max_iter, hidden_layer_sizes, activation, solver, alpha, learning_rate,
learning_rate_init.
▪ max_iter: The maximum number of iterations (default = 200).
▪ hidden_layer_sizes: The ith element represents the number of neurons in the ith hidden layer.
▪ mlp_activation: The activation function for the hidden layer (“identity,” “logistic,” “relu,” “softmax,” “tanh”). The
default is “relu.”
▪ solver: The solver for weight optimization (“lbfgs,” “sgd,” “adam”). The default is “adam.”
▪ alpha: The strength of the L2 regularization term (default = 0.0001).
▪ mlp_learning_rate: The learning rate schedule for weight updates (“constant,” “invscaling,” “adaptive”). The
default is “constant.”
▪ learning_rate_init: The initial learning rate used (for sgd or adam). It controls the step size in updating the weights.
24 1 Concepts, Libraries, and Essential Tools in Machine Learning and Deep Learning
◦ mlp_neural_network_auto
– For each classification algorithm, we also need to add the number of k-folds for cross-validation (cv).
– Example:
# Import Data
df = breastcancer() # Load the breast cancer dataset
df = df.drop(["id"], axis=1) # Drop the 'id' column
# Run ML Pipeline
ml_pipeline_function(
df,
output_folder='./Outputs/', # Store the output in the './Outputs/' folder
missing_method='row_removal', # Remove rows with missing values
test_size=0.2, # Set the test set size to 20% of the dataset
categorical=['label_encoding'], # Apply label encoding for categorical features
features_label=['Target'], # Apply label encoding for the 'Target' feature
rescaling='standard_scaler', # Apply standard scaling to the features
classification_algorithms=[
'svm_rbf', # Apply Support Vector Machine with radial basis function kernel
'lda', # Apply Linear Discriminant Analysis
'random_forest' # Apply Random Forest Classifier
],
n_estimators_forest=100, # Set the number of trees in the random forest to 100
cv=5 # Perform 5-fold cross-validation
)
# Execute the ML pipeline with the loaded dataset, removing rows with missing values,
# using a test set of 20%, applying label encoding for the 'Target' feature, rescaling
the features using the standard scaler,
# and performing classification using SVM with RBF kernel, LDA, and Random Forest with
the specified parameters.
The code above will print the steps of the processes and provide the metrics of our models such as the following:
# Import Data
df = breastcancer() # Load the breast cancer dataset
df = df.drop(["id"], axis=1) # Drop the 'id' column
# Run ML Pipeline
ml_pipeline_function(
df,
output_folder='./Outputs/', # Store the output in the './Outputs/' folder
missing_method='row_removal', # Remove rows with missing values
test_size=0.2, # Set the test set size to 20% of the dataset
categorical=['label_encoding'], # Apply label encoding for categorical features
features_label=['Target'], # Apply label encoding for the 'Target' feature
rescaling='standard_scaler', # Apply standard scaling to the features
classification_algorithms=[
'svm_rbf', # Apply Support Vector Machine with radial basis function kernel
'lda', # Apply Linear Discriminant Analysis
'random_forest', # Apply Random Forest Classifier
'gpu_logistic_regression' # Apply GPU-accelerated Logistic Regression
],
n_estimators_forest=100, # Set the number of trees in the random forest to 100
gpu_logistic_activation='adam', # Set the activation function for GPU Logistic
Regression to 'adam'
gpu_logistic_optimizer='adam', # Set the optimizer for GPU Logistic Regression to
'adam'
gpu_logistic_epochs=50, # Set the number of training epochs for GPU Logistic
Regression to 50
cv=5 # Perform 5-fold cross-validation
)
# Execute the ML pipeline with the loaded dataset, removing rows with missing values,
# using a test set of 20%, applying label encoding for the 'Target' feature, rescaling
the features using the standard scaler,
# and performing classification using SVM with RBF kernel, LDA, Random Forest, and
GPU-accelerated Logistic Regression with the specified parameters.
26 1 Concepts, Libraries, and Essential Tools in Machine Learning and Deep Learning
The code above will print the steps of the processes and provide the metrics of our models such as the following:
# Import Data
df = breastcancer() # Load the breast cancer dataset
df = df.drop(["id"], axis=1) # Drop the 'id' column
# Run ML Pipeline
ml_pipeline_function(
df,
output_folder='./Outputs/', # Store the output in the './Outputs/' folder
missing_method='row_removal', # Remove rows with missing values
test_size=0.2, # Set the test set size to 20% of the dataset
categorical=['label_encoding'], # Apply label encoding for categorical features
features_label=['Target'], # Apply label encoding for the 'Target' feature
rescaling='standard_scaler', # Apply standard scaling to the features
classification_algorithms=[
'svm_rbf', # Apply Support Vector Machine with radial basis function kernel
'lda', # Apply Linear Discriminant Analysis
'random_forest', # Apply Random Forest Classifier
'gpu_logistic_regression' # Apply GPU-accelerated Logistic Regression
],
n_estimators_forest=100, # Set the number of trees in the random forest to 100
gpu_logistic_optimizer='adam', # Set the optimizer for GPU Logistic Regression to
'adam'
gpu_logistic_epochs=50, # Set the number of training epochs for GPU Logistic
Regression to 50
gpu_logistic_loss = 'mse', # Set the loss functions, such as the mean squared error
(“mse”), the binary logarithmic loss (“binary_crossentropy”), or the multi-class
logarithmic loss (“categorical_crossentropy”)
cv=5 # Perform 5-fold cross-validation
)
# Execute the ML pipeline with the loaded dataset, removing rows with missing values,
# using a test set of 20%, applying label encoding for the 'Target' feature, rescaling
the features using the standard scaler,
# and performing classification using SVM with RBF kernel, LDA, Random Forest, and
GPU-accelerated Logistic Regression with the specified parameters.
1.3 HephAIstos for Running Machine Learning on CPUs, GPUs, and QPUs 27
# Create a rotational layer to train. We will rotate each qubit the same amount.
training_params = ParameterVector("θ", 1)
fm0 = QuantumCircuit(feature_dimension)
for qubit in range(feature_dimension):
fm0.ry(training_params[0], qubit)
◦ ibmq_quito
◦ simulator_statevector
◦ simulator_extended_stabilizer
◦ simulator_stabilizer
◦ ibmq_manila
– multi-class: We can use “OneVsRestClassifier,” “OneVsOneClassifier,” and “svc” if we want to pass our quantum ker-
nel to SVC from scikit-learn or “None” if we wish to use QSVC from Qiskit.
– For Pegasos algorithms:
◦ n_steps = the number of steps performed during the training procedure.
◦ C = the regularization parameter.
– Example:
# Import Data
df = breastcancer() # Load the breast cancer dataset
df = df.drop(["id"], axis=1) # Drop the 'id' column
# Run ML Pipeline
ml_pipeline_function(
df,
output_folder='./Outputs/', # Store the output in the './Outputs/' folder
missing_method='row_removal', # Remove rows with missing values
test_size=0.2, # Set the test set size to 20% of the dataset
categorical=['label_encoding'], # Apply label encoding for categorical features
features_label=['Target'], # Apply label encoding for the 'Target' feature
rescaling='standard_scaler', # Apply standard scaling to the features
features_extraction='pca', # Apply Principal Component Analysis for feature extraction
classification_algorithms=['svm_linear'], # Apply Support Vector Machine with a
linear kernel
number_components=2, # Set the number of principal components to 2
cv=5, # Perform 5-fold cross-validation
quantum_algorithms=[
'q_kernel_default',
'q_kernel_zz',
'q_kernel_8',
'q_kernel_9',
'q_kernel_10',
'q_kernel_11',
'q_kernel_12'
], # List of quantum algorithms to use
reps=2, # Set the number of repetitions for the quantum circuits
ibm_account=YOUR_API, # Replace with your IBM Quantum API key
quantum_backend='qasm_simulator' # Use the QASM simulator as the quantum backend
)
# Execute the ML pipeline with the loaded dataset, removing rows with missing values,
# using a test set of 20%, applying label encoding for the 'Target' feature, rescaling
the features using the standard scaler,
# and performing classification using SVM with a linear kernel, PCA for feature
extraction, and quantum algorithms with the specified parameters.
1.3 HephAIstos for Running Machine Learning on CPUs, GPUs, and QPUs 29
We can also choose “least_busy” as a quantum_backend option in order to execute the algorithms on the chip that as the
lower number of jobs in the queue:
quantum_backend = 'least_busy'
• Regression algorithms:
– Regression algorithms used only with CPUs:
◦ linear_regression
◦ svr_linear
◦ svr_rbf
◦ svr_sigmoid
◦ svr_poly
◦ mlp_regression
◦ mlp_auto_regression
– Regression algorithms that use GPUs, if available:
◦ gpu_linear_regression: Linear regression using the SGD optimizer. As for classification, we need to add some
parameters:
▪ gpu_linear_activation: “Linear.”
▪ gpu_linear_epochs: An integer to define the number of epochs.
▪ gpu_linear_learning_rate: The learning rate for the SGD optimizer.
▪ gpu_linear_loss: The loss functions such as the mean squared error (“mse”), the binary logarithmic loss
(“binary_crossentropy”), or the multi-class logarithmic loss (“categorical_crossentropy”).
◦ gpu_mlp_regression: Multi-layer perceptron neural network using GPUs for regression, with the following
parameters to set:
▪ gpu_mlp_epochs_r: The number of epochs with an integer.
▪ gpu_mlp_activation_r: The activation function such as softmax, sigmoid, linear, or tanh.
▪ The chosen optimizer is “adam.” Note that no activation function is used for the output layer because it is a
regression. We use mean_squared_error for the loss function.
◦ gpu_rnn_regression: Recurrent neural network for regression. We need to set the following parameters:
▪ rnn_units: The dimensionality of the output space (positive integer).
▪ rnn_activation: The activation function to use (softmax, sigmoid, linear, or tanh).
▪ rnn_optimizer: The optimizer (adam, sgd, or RMSprop).
▪ rnn_loss: The loss function such as the mean squared error (“mse”), the binary logarithmic loss (“binary_
crossentropy”), or the multi-class logarithmic loss (“categorical_crossentropy”).
▪ rnn_epochs: The number of epochs (integer).
– Example:
# Drop the 'id' column from the dataset as it is not relevant for analysis
df = df.drop(["id"], axis=1)
# Call the machine learning pipeline function with the following parameters:
# - DataFrame (df )
# - Output folder for saving results ('./Outputs/')
30 1 Concepts, Libraries, and Essential Tools in Machine Learning and Deep Learning
The few lines of code above will print the steps of the processes and provide the metrics of our models such as the
following:
df = pd.read_csv(DailyDelhiClimateTrain, delimiter=',')
# Extract the year, month, and day from the 'date' column and create new columns for
each
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
# Run the Machine Learning (ML) pipeline function with the following parameters:
# - Dataframe: df
# - Output folder: './Outputs/'
# - Missing data handling method: 'row_removal'
# - Test dataset size: 20% of the total dataset
# - Data rescaling method: 'standard_scaler'
# - Regression algorithm to be used: 'gpu_rnn_regression'
# - GPU RNN Regression specific parameters:
# - Number of units: 500
# - Activation function: 'tanh'
# - Optimizer: 'RMSprop'
# - Loss function: 'mse'
# - Number of training epochs: 50
ml_pipeline_function(
df,
output_folder='./Outputs/',
missing_method='row_removal',
test_size=0.2,
rescaling='standard_scaler',
regression_algorithms=['gpu_rnn_regression'],
rnn_units=500,
rnn_activation='tanh',
rnn_optimizer='RMSprop',
rnn_loss='mse',
rnn_epochs=50,
)
32 1 Concepts, Libraries, and Essential Tools in Machine Learning and Deep Learning
# Load the MNIST dataset into a tuple of tuples (train and test data)
df = mnist.load_data()
# Separate the dataset into features (X) and labels (y) for both train and test data
(X, y), (_,_) = mnist.load_data()
# Assign the train and test data to variables
(X_train, y_train), (X_test, y_test) = df
# Reshape the training data to fit the model's input shape (number of images, height,
width, and channels)
X_train = X_train.reshape(X_train.shape[0],X_train.shape[1],X_train.shape[2],1)
# Reshape the testing data to fit the model's input shape
X_test = X_test.reshape(X_test.shape[0],X_test.shape[1],X_test.shape[2],1)
# Reshape the whole dataset to fit the model's input shape
X = X.reshape(X.shape[0],X.shape[1],X.shape[2],1)
# Call the ml_pipeline_function with the given parameters to train and evaluate a
convolutional neural network
ml_pipeline_function(df, X, y, X_train, y_train, X_test, y_test, output_folder =
'./Outputs/', convolutional=['conv2d'], conv_activation='relu', conv_kernel_size =
3, conv_optimizer = 'adam', conv_loss='categorical_crossentropy', conv_epochs=1)
Almost all the datasets used in this book can be found at the following link:
https://github.com/xaviervasques/hephaistos/tree/main/data/datasets.
Further Reading 33
Another way to use the datasets is to move them to the hephAIstos folder so that you can download or clone them from
GitHub (https://github.com/xaviervasques/hephaistos.git). Within the hephAIstos folder, the data are located in data/data-
sets and all code examples are in the Notebooks folder.
If you open a terminal, move to hephAIstos/Notebooks, and type jupyter notebook; your browser will open and you will
find all the Jupyter Notebooks that are available for this book. If you open any Jupyter Notebook, the path for the datasets is
already defined in the various code examples.
Further Reading
Chicco, D. and Jurman, G. (2020). The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in
binary classification evaluation. BMC Genomics 21 (6): 6. https://doi.org/10.1186/s12864-019-6413-7. PMC 6941312. PMID
31898477.
Derczynski, L. (2016). Complementarity, F-score, and NLP evaluation. In: Proceedings of the International Conference on Language
Resources and Evaluation.
Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters 27 (8): 861–874. https://doi.org/10.1016/
j.patrec.2005.10.010.
Goodfellow, I., Bengio, Y., and Courville, A. Deep Learning Book. MIT Press. http://www.deeplearningbook.org.
Hand, D. and Christen, P. (2018). A note on using the F-measure for evaluating record linkage algorithms. Statistics and
Computing 28: 539–547. https://doi.org/10.1007/s11222-017-9746-6.
Madeh, P.S. and El-Diraby Tamer, E. (2020). Data analytics in asset management: cost-effective prediction of the pavement
condition index. Journal of Infrastructure Systems 26 (1): 04019036. https://doi.org/10.1061/(ASCE)IS.1943-555X.0000512.
Pedregosa, F., Varoquaux, G., Gramfort, A. et al. (2011). Scikit-learn: machine learning in Python. Journal of Machine Learning
Research 12 (85): 2825–2830.
Powers, D.M.W. (2011). Evaluation: from precision, recall and F-measure to ROC, informedness, markedness & correlation.
Journal of Machine Learning Technologies 2 (1): 37–63.
Christen, P., Hand, D.J., and Kirielle, N. (2023). A review of the F-measure: its history, properties, criticism, and alternatives. ACM
Computing Surveys. https://doi.org/10.1145/3606367.
Siblini, W., Fréry, J., He-Guelton, L. et al. (2020). Master your metrics with calibration. In: Advances in Intelligent Data Analysis
XVIII (ed. M. Berthold, A. Feelders, and G. Krempl), 457–469. Springer. https://doi.org/10.1007/978-3-030-44584-3_36.
Taha, A.A. and Hanbury, A. (2015). Metrics for evaluating 3D medical image segmentation: analysis, selection, and tool. BMC
Medical Imaging 15 (29): 1–28. https://doi.org/10.1186/s12880-015-0068-x.
Van Rijsbergen, C.J. (1979). Information Retrieval, 2e. Butterworth-Heinemann.
Williams, C.K.I. (2021). The effect of class imbalance on precision-recall curves. Neural Computation 33 (4): 853–857.
arXiv:2007.01905. https://doi.org/10.1162/neco_a_01362. PMID 33513323.
https://developer.nvidia.com/cuda-downloads?
target_os=Linux&target_arch=x86_64&=Ubuntu&target_version=20.04&target_type=deb_local
https://developer.nvidia.com/cudnn
https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html
https://dorianbrown.dev/what-is-supervised-learning/
https://github.com/tensorflow/models/tree/master/official
https://machinelearningmastery.com/k-fold-cross-validation/
https://medium.com/geekculture/installing-cudnn-and-cuda-toolkit-on-ubuntu-20-04-for-machine-learning-tasks-f41985fcf9b2
https://www.analyticsvidhya.com/blog/2019/08/detailed-guide-7-loss-functions-machine-learning-python-code/
https://www.hindawi.com/journals/cmmm/2021/8500314/
https://www.ibm.com/cloud/learn/underfitting
https://www.toolbox.com/tech/artificial-intelligence/articles/top-python-libraries-for-machine-learning/
35
Datasets are samples that are composed of several observations. Each observation is a set of values associated with a set of
variables. The transformations we need to perform before starting to train models depend on the nature of features or vari-
ables. In the variable’s realm, we can find quantitative and qualitative variables. Quantitative variables can be continuous,
which means composed of real values, or discrete, which means that they can only take values from within a finite set or an
infinite but countable set. Qualitative variables do not translate mathematical magnitudes and can be ordinal, for which an
order relationship can be defined, or nominal, where no order relationship can be found between values.
Feature engineering is an important part of the data science workflow that can greatly impact the performance of machine
learning algorithms. Anomalies must be corrected or at least not ignored, and we need to adjust for missing values, eliminate
duplicate observations, digitalize the data to facilitate the use of machine learning tools, encode categorical data, rescale
data, and perform other tasks. For instance, often we can replace a missing value by the mean, but when the number
of missing values is important, it can create bias. Instead, we can choose linear regression, which can be more complicated
but more helpful as the new values will not be chosen randomly. The detection of outliers is also a good exercise. Some
extreme values and errors that are considered normal for the models need to be detected using different techniques such
as boxplots. A descriptive analysis of the data is something to consider before going forward in machine learning workflows.
We will make some decisions in order to have a dataset that is representative of the real world. In the real world, datasets are
not expressed in the same scale, forcing us to perform data standardization, normalization, or scaling to produce comparable
scales across our datasets.
Data preprocessing is a crucial step in machine and deep learning. It is indeed integral to machine learning, as the quality
of data and the useful information that can be derived from it directly affects the ability of a model to learn; therefore, it is
extremely important that we preprocess our data before feeding it into our model. It is necessary to perform a certain num-
ber of operations to obtain data of quality, as we will see in this chapter.
Machine Learning Theory and Applications: Hands-on Use Cases with Python on Classical and Quantum Machines, First Edition. Xavier Vasques.
© 2024 John Wiley & Sons, Inc. Published 2024 by John Wiley & Sons, Inc.
36 2 Feature Engineering Techniques in Machine Learning
Most of the time, we will encounter different types of variables with different ranges that can differ considerably in the same
dataset. If we use the data with the original scale, we will certainly put more weight on the variables with a large range.
Therefore, we need to apply what is called feature rescaling to ensure that the variables are almost on the same scale, which
allows consideration of features as equally important (apples to apples).
Often, we refer to the concepts of standardization, normalization, or scaling. Scaling indicates that we change the range
of the values without changing the shape of the distribution. The range is often set at 0 to 1 or −1 to 1. Standardization refers
to converting the data distribution into a normal form and transforming the mean of the data to 0 and its variance to 1.
Normalization refers to transforming the data into the range [0, 1] or dividing a vector by a norm. Normalization can reflect
not only this definition but also others, creating confusion because it has many definitions in statistical learning. Perhaps the
best choice is to always define the word we use before any communication.
Many machine learning algorithms perform better or converge faster when features are close to normally distributed
and on a relatively similar scale. It is important to transform data for each feature to have a variance equivalent to
others in orders of magnitude to avoid dominance of an objective function (such as the radial basis function kernel of
support vector machines) that would make the estimator unable to learn correctly. Data scaling is required in
algorithms such as those based on gradient descent (linear regression, neural networks) and distance-based algorithms
(k-nearest neighbors, k-means, SVM). For example, it is meaningless if we do not standardize our data before
measuring the importance of variables in regression models or before lasso and ridge regressions. The reason we stand-
ardize data is not the same for all machine learning algorithms. For example, in distance-based models, we perform
standardization to prevent features with wider ranges from dominating the distance metric. Some algorithms,
including tree-based algorithms such as decision tree, random forest, and gradient boosting, are not sensitive to
the magnitude of variables.
We can implement data preprocessing for machine learning using many available methods, such as doing it yourself
or using NumPy, SciPy, or those proposed in scikit-learn (MinMaxScaler, RobustScaler, StandardScaler, MaxAbsScaler,
Normalizer). If you download hephAIstos from GitHub (https://github.com/xaviervasques/hephaistos.git), you will find
all code examples in hephaistos/Notebooks/Features_rescaling.ipynb.
Before going into the details, let us import some libraries:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use("seaborn")
from scipy.stats import skew
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
from sklearn import preprocessing
import os
Nature does not always produce perfectly normal distributions. We will apply the transformations described
above with a dataset coming from 46 MRIs (17 healthy controls, 14 patients with disease type 1, 15 patients with disease
type 2). Cortical and subcortical features were extracted using Freesurfer, version 7; in total, 474 features were extracted
from 3D MRI T1-weighted images. (Freesurfer is a set of tools for analysis and visualization of cortical and subcortical
brain imaging data. It is designed around an automated workflow that includes several standard image processing
phases.)
To start, we will extract from the 474 features only the “TotalGrayVol” data, which corresponds to the total gray volume of
each brain analyzed.
Let us load the data from a .csv file and extract the “TotalGrayVol” feature.
2.1 Feature Rescaling: Structured Continuous Numeric Data 37
Input:
# Load data
csv_data = '../data/datasets/brain_train.csv'
data = pd.read_csv(csv_data, delimiter=';')
df = data[["Target","TotalGrayVol"]]
y = df.loc[:, data_train.columns == 'Class'].values.ravel()
X = df.loc[:, data_train.columns != 'Class']
print(df.head())
Output:
Target TotalGrayVol
0 1 684516.128934
1 1 615126.611828
2 1 678687.178551
3 1 638615.189584
4 1 627356.125850
x − mean x
z=
standard deviation x
Many machine learning estimators will behave badly if the individual features do not appear as standard, normally dis-
tributed data (Gaussian with zero mean and unit variance). Most of the time, data scientists do not look at the shape of the
distribution. They simply center data by removing the mean value of each feature and scale it by dividing non-constant
values by their standard deviation. The StandardScaler technique assumes that data are normally distributed
(Figures 2.1 and 2.2). Essentially, the idea of dataset standardization is to have equal range and variance from a rescaled
original variable.
Let us imagine that the data are not normally distributed and that 50% of the data points have a value of 0 and the remain-
ing 50% have a value of 1. Gaussian scaling will send half of the data to −1 and the other half to +1; the data are moved away
from 0. Depending on the shape of our data, we need to consider other kinds of scaling that will produce better results.
1
σ
μ 0
Figure 2.1 In standardization, features are rescaled to ensure that the mean and the standard deviation are 0 and 1, respectively.
38 2 Feature Engineering Techniques in Machine Learning
68% within
1 standard
deviation
μ – 3σ μ – 2σ μ–σ μ μ+σ μ + 2σ μ + 3σ
Figure 2.2 The result of this transformation is that we have shifted the means to 0; most of the data (68%) would thus be
between –1 and 1.
StandardScaler = preprocessing.StandardScaler()
stdscaler_transformed = StandardScaler.fit_transform(X)
plt.rcParams["figure.figsize"] = 13,5
fig,ax = plt.subplots(1,2)
sns.distplot(X, label= "Orginal Skew :{0}".format(np.round(skew(X),4)),
color="magenta", ax=ax[0], axlabel="ORGINAL")
sns.distplot(stdscaler_transformed, label= "Transformed Skew:{0}".format(np.round
(skew(stdscaler_transformed),4)), color="g", ax=ax[1], axlabel="StandardScaler")
fig.legend()
plt.show()
Output:
Original Skew: [–0.5994]
Transformed Skew: [–0.5994]
1e–6
7
0.5
6
5 0.4
4
Density
Density
0.3
3
0.2
2
0.1
1
0 0.0
300000 400000 500000 600000 700000 800000 900000 –4 –2 0 2 4
Original StandardScaler
We can see that the shape of the distribution has not changed. The top of the curve is centered at 0, as expected.
2.1 Feature Rescaling: Structured Continuous Numeric Data 39
2.1.1.2 MinMaxScaler
We can also transform features by scaling each of them to a defined range (e.g., between −1 and 1 or between 0 and 1).
Min-max scaling (MinMaxScaler), for instance, can be very useful for some machine learning models. MinMaxScaler
has some advantages over StandardScaler when the data distribution is not Gaussian and the feature falls within a bounded
interval, which is typically the case with pixel intensities that fit within a range of 0–255.
MinMaxScaler is calculated as follows:
x i − min x
z=
max x − min x
We can code MinMaxScaler as follows:
Input:
min_max_scaler = preprocessing.MinMaxScaler()
minmax_transformed = min_max_scaler.fit_transform(X)
plt.rcParams["figure.figsize"] = 13,5
fig,ax = plt.subplots(1,2)
sns.distplot(X, label= "Orginal Skew :{0}".format(np.round(skew(X),4)),
color="magenta", ax=ax[0], axlabel="ORGINAL")
sns.distplot(minmax_transformed, label= "Transformed Skew:{0}".format(np.round(skew
(minmax_transformed),4)), color="g", ax=ax[1], axlabel="MinMaxScaler")
fig.legend()
plt.show()
Output:
Original Skew: [–0.5994]
Transformed Skew: [–0.5994]
1e–6
7
3.0
6
2.5
5
2.0
4
Density
Density
1.5
3
1.0
2
0.5
1
0 0.0
300000 400000 500000 600000 700000 800000 900000 –0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2
Original MinMaxScaler
By default, the feature range is between 0 and 1. We can modify this range by adding the option feature_range:
preprocessing.MinMaxScaler(feature_range=(0, 1))
40 2 Feature Engineering Techniques in Machine Learning
2.1.1.3 MaxAbsScaler
The MaxAbsScaler is similar to the MinMaxScaler with the difference that it automatically scales the data between 0 and
1 based on the absolute maximum. This scaler is specifically suitable for data that is already centered at zero or sparse
data and does not center the data, maintaining sparsity:
xi
z=
max abs x
We can code MaxAbsScaler as follows:
Input:
max_abs_scaler = preprocessing.MaxAbsScaler()
maxabs_transformed = max_abs_scaler.fit_transform(X)
plt.rcParams["figure.figsize"] = 13,5
fig,ax = plt.subplots(1,2)
sns.distplot(X, label= "Orginal Skew :{0}".format(np.round(skew(X),4)),
color="magenta", ax=ax[0], axlabel="ORGINAL")
sns.distplot(maxabs_transformed, label= "Transformed Skew:{0}".format(np.round(skew
(maxabs_transformed),4)), color="g", ax=ax[1], axlabel="MaxAbsScaler")
fig.legend()
plt.show()
5
6
5 4
4
Density
Density
3
2
2
1
1
0 0
300000 400000 500000 600000 700000 800000 900000 0.4 0.6 0.8 1.0
Original MaxAbsScaler
2.1.1.4 RobustScaler
If the data contain a considerable number of outliers, the use of the mean and variance to scale the data will probably not
work correctly. In this case, an option is to use RobustScaler, which removes the median and scales the data according to
the quantile range:
x i − Q1 x
z=
Q 3 x − Q1 x
As we can see (Figure 2.3), the scaler uses the interquartile range (IQR), which is the range between the first quartile
Q1(x) and the third quartile Q3(x).
We will see how feature scaling can significantly improve the performance of some machine learning algorithms but not
improve it at all for others.
2.1 Feature Rescaling: Structured Continuous Numeric Data 41
Figure 2.3 The scaler uses the interquartile range (IQR), which is the range
between the first quartile Q1(x) and the third quartile Q3(x). Median
25% 25%
25% 25%
RobustScaler = preprocessing.RobustScaler()
Robust_transformed = RobustScaler.fit_transform(X)
plt.rcParams["figure.figsize"] = 13,5
fig,ax = plt.subplots(1,2)
sns.distplot(X, label= "Orginal Skew :{0}".format(np.round(skew(X),4)),
color="magenta", ax=ax[0], axlabel="ORGINAL")
sns.distplot(robust_transformed, label= "Transformed Skew:{0}".format(np.round(skew
(robust_transformed),4)), color="g", ax=ax[1], axlabel="RobustScaler")
fig.legend()
plt.show()
6
0.5
5
0.4
4
Density
Density
0.3
3
0.2
2
0.1
1
0 0.0
300000 400000 500000 600000 700000 800000 900000 –4 –3 –2 –1 0 1 2 3
Original RobustScaler
42 2 Feature Engineering Techniques in Machine Learning
•• The norm of a vector is always positive || a || ≥ 0 or zero if and only if the vector is a zero vector (a = 0).
A scalar multiple to a norm is equal to the product of the absolute value of the scalar and the norm: norm||ka|| =
|k| ||a||.
• Norm of a vector obeys triangular inequality. The norm of the sum of some vectors is less than or equal to the sum of the
norms of these vectors: ||a + b|| ≤ ||a|| + ||b||.
The length of a vector ||a||1 can be calculated using the l1 norm as the sum of the absolute value of the vector elements:
||a||1 = |a1| + |a2| + … + |an|. The l2 norm, or Euclidian norm, is the most often used vector norm: ||a||2 =
a21 + a22 + … + a2n . The p-norm, called Minkowski norm, is defined as follows: ||a||p = p ap1 + ap2 + … + apn . The
max-norm, also called Chebyshev norm, is the largest absolute element in the vector: ||a||∞ = max [|a1|, |a2|, …, |an|].
In machine learning, we usually use the l1 norm when the sparsity of the vector matters. The sparsity corresponds to
the property of having highly significant coefficients (either very near to or very far from zero) where the coefficient
very near zero could later be eliminated. This technique is an alternative when we are processing a large set of features.
One consideration when building a model is its ability to ignore extreme values in the data, in other words, its resist-
ance to outliers in a dataset. The l1 norm is more robust than the l2 norm from this point of view, as it only considers
the absolute values; thus, it treats them linearly, whereas the l2 norm increases the cost of outliers exponentially. In an
opposite sense, the resistance to horizontal adjustments is more stable with the l2 norm than with the l1 norm. The l1
norm is computationally more expensive than the l2 norm because we cannot solve it in terms of matrix operations.
If x is the vector of covariates of length n, and we assume that the normalized vector z = x/y, then three options denote
what to use for y:
For the l1 option,
n
y= x 1 = xi
i=1
n
y= x 2 = xi 2
i=1
y = max x i
As stated above, there are different types of normalization. Here, we consider unit vector normalization. Assume we have
a dataset X with D columns (features) and N rows (entries). The unit vector normalization of the dataset is calculated as
follows:
X j,
Z j, =
X j,
Normalize = preprocessing.Normalizer()
Norm_transformed = Normalize.fit(X)
plt.rcParams["figure.figsize"] = 13,5
fig,ax = plt.subplots(1,2)
sns.distplot(X, label= "Orginal Skew :{0}".format(np.round(skew(X),4)),
color="magenta", ax=ax[0], axlabel="ORGINAL")
sns.distplot(norm_transformed, label= "Transformed Skew:{0}".format(np.round(skew
(norm_transformed),4)), color="g", ax=ax[1], axlabel="Normalizer")
fig.legend()
plt.show()
6
5
5
4
4
Density
Density
3
3
2
2
1
1
0 0
300000 400000 500000 600000 700000 800000 900000 0.6 0.8 1.0 1.2 1.4
Original Normalizer
In scikit-learn, the normalizer uses l2 by default. We can change this behavior by using the norm option
(“l1,” “l2,” “max”):
sklearn.preprocessing.Normalizer(norm='l2', *, copy=True)
σ = 0.25
0.75 σ=1
CDF
0.50
0.25
0.00
0.0 0.5 1.0 1.5 2.0 2.5
X
where ϕ is the cumulative distribution function of the normal distribution. The cumulative function of the standard normal
distribution is the following:
x x2
e− 2
F x =
2Π
−∞
The output shown in Figure 2.4 is the plot of lognormal cumulative distribution functions.
•• Log transformation, in which each variable x is replaced by log (x) using a natural, base-10, or base-2 logarithm.
Square root transformation, in which x is replaced by the square root of x. (This will produce a moderate effect but can be
applied to zero values.)
• Yeo-Johnson transformation.
The log transformation can be coded using NumPy:
Input:
log_target = np.log1p(X)
plt.rcParams["figure.figsize"] = 13,5
fig,ax = plt.subplots(1,2)
sns.distplot(X, label= "Orginal Skew:{0}".format(np.round(skew(X),4)),
color="magenta", ax=ax[0], axlabel="ORGINAL")
sns.distplot(log_target, label= "Transformed Skew:{0}".format(np.round(skew
(log_target),4)), color="g", ax=ax[1], axlabel="LOG TRANSFORMED")
fig.legend()
plt.show()
2.1 Feature Rescaling: Structured Continuous Numeric Data 45
6 4
5
3
Density
Density
4
3 2
2
1
1
0 0
300000 400000 500000 600000 700000 800000 900000 12.6 12.8 13.0 13.2 13.4 13.6 13.8
Original Log transformed
sqrrt_target = df['TotalGrayVol']**(1/2)
plt.rcParams["figure.figsize"] = 13,5
fig,ax = plt.subplots(1,2)
sns.distplot(df['TotalGrayVol'], label= "Orginal Skew:{0}".format(np.round(skew(df
['TotalGrayVol']),4)), color="magenta", ax=ax[0], axlabel="ORGINAL")
sns.distplot(sqrrt_target, label= "Transformed Skew:{0}".format(np.round(skew
(sqrrt_target),4)), color="g", ax=ax[1], axlabel="SQUARE ROOT TRANSFORMED")
fig.legend()
plt.show()
7
0.010
6
0.008
5
Density
Density
4 0.006
3
0.004
2
0.002
1
0 0.000
300000 400000 500000 600000 700000 800000 900000 600 700 800 900 1000
Original Square root transformed
46 2 Feature Engineering Techniques in Machine Learning
re_target = 1/df['TotalGrayVol']
plt.rcParams["figure.figsize"] = 13,5
fig,ax = plt.subplots(1,2)
sns.distplot(df['TotalGrayVol'], label= "Orginal Skew :{0}".format(np.round(skew(df
['TotalGrayVol']),4)), color="magenta", ax=ax[0], axlabel="ORGINAL")
sns.distplot(re_target, label= "Transformed Skew:{0}".format(np.round(skew
(re_target),4)), color="g", ax=ax[1], axlabel="INVERSE TRANSFORMED")
fig.legend()
plt.show()
Output:
Original Skew: –0.5856
Transformed Skew: 1.7916
1e–6 1e6
3.5
7
3.0
6
2.5
5
2.0
Density
Density
1.5
3
1.0
2
1 0.5
0 0.0
300000 400000 500000 600000 700000 800000 900000 1.0 1.5 2.0 2.5 3.0
1e–6
Original Inverse transformed
For all x that are strictly positive, we define the Box-Cox transformation as follows:
xλ − 1
if λ 0
B x, λ = λ
log x if λ = 0
where λ is a parameter that we choose. We can perform the Box-Cox transformation on both time-based and non-time series
data. We must choose a value for λ that provides the best approximation of the normal distribution of our feature. In Python,
SciPy has a boxcox function that chooses the optimal value of λ for us (scipy.stats.boxcox). It is also possible to specify an
alpha number that calculates the confidence interval for the optimal value of λ.
Box-Cox transformation requires input data to be strictly positive. To code Box-Cox, we can use SciPy as follows:
Input:
plt.rcParams["figure.figsize"] = 13,5
fig,ax = plt.subplots(1,2)
2.1 Feature Rescaling: Structured Continuous Numeric Data 47
We can see in the output that the skew of the transformed data is very close to zero:
Original Skew: –0.5856
Transformed Skew: 0.0659
1e–6 1e–12
1.6
7
1.4
6
1.2
5
1.0
Density
Density
4
0.8
3 0.6
2 0.4
1 0.2
0 0.0
300000 400000 500000 600000 700000 800000 900000 0.0 0.5 1.0 1.5 2.0 2.5 3.0
Box-Cox transformed 1e12
Original
Like the Box-Cox transformation, the Yeo-Johnson is a nonlinear transformation that allows reducing the skewness and
obtaining a distribution closer to normal:
y + 1 λ −1
if λ 0, y ≥ 0
λ
log y + 1 if λ = 0, y ≥ 0
ψ λ, y = 2−λ
−y + 1 −1
− if λ 2, y < 0
2−λ
− log − y + 1 if λ = 2, y < 0
Yeo-Johnson transformation supports both positive and negative data. As is the case for Box-Cox, we can use SciPy to code
Yeo-Johnson transformation.
Input:
plt.rcParams["figure.figsize"] = 13,5
fig,ax = plt.subplots(1,2)
48 2 Feature Engineering Techniques in Machine Learning
or
Output:
Original Skew: –0.5856
Transformed Skew: 0.0659
1e–6 1e–12
1.6
7
1.4
6
1.2
5
1.0
Density
Density
4
0.8
3 0.6
2 0.4
1 0.2
0 0.0
300000 400000 500000 600000 700000 800000 900000 0.0 0.5 1.0 1.5 2.0 2.5 3.0
Box-Cox transformed 1e12
Original
X μ, σ 2
QX p = 2σ erf − 1 2p − 1 + μ
where erf –1(x) is the inverse error function.
Alternatively, let X be a random variable following a continuous uniform distribution:
X a, b
The quantile function of X is then the following:
−∞ , if p = 0
QX p =
bp + a 1 − p , if p > 0
2.1 Feature Rescaling: Structured Continuous Numeric Data 49
We can code quantile transformation with Gaussian distribution using scikit-learn, as follows:
Input:
quantile_transform = Transformer.fit_transform(X)
plt.rcParams["figure.figsize"] = 13,5
fig,ax = plt.subplots(1,2)
sns.distplot(X, label= "Orginal Skew :{0}".format(np.round(skew(X),4)),
color="magenta", ax=ax[0], axlabel="ORGINAL")
sns.distplot(quantile_transform, label= "Transformed Skew:{0}".format(np.round(skew
(quantile_transform),4)), color="g", ax=ax[1], axlabel="Quantile")
fig.legend()
plt.show()
Output:
1e–6
7
0.35
6
0.30
5
0.25
4
Density
Density
0.20
3 0.15
2 0.10
1 0.05
0 0.00
300000 400000 500000 600000 700000 800000 900000 –6 –4 –2 0 2 4 6
Original Quantile
quantile_transform = Transformer.fit_transform(X)
plt.rcParams["figure.figsize"] = 13,5
fig,ax = plt.subplots(1,2)
sns.distplot(X, label= "Orginal Skew :{0}".format(np.round(skew(X),4)),
color="magenta", ax=ax[0], axlabel="ORGINAL")
sns.distplot(quantile_transform, label= "Transformed Skew:{0}".format(np.round(skew
(quantile_transform),4)), color="g", ax=ax[1], axlabel="Quantile")
fig.legend()
plt.show()
50 2 Feature Engineering Techniques in Machine Learning
Output:
1e–6
7
1.0
6
0.8
5
Density
Density
0.6
3
0.4
0.2
1
0 0.0
300000 400000 500000 600000 700000 800000 900000 –0.50 –0.25 0.00 0.25 0.50 0.75 1.00 1.25 1.50
Original Quantile
StandardScaler 0.7
MinMaxScaler 0.2
MaxAbsScaler 0.8
RobustScaler 0.9
Normalizer 0.8
Log Transformation 0.8
Yeo-Johnson 0.9
Quantile with Gaussian 0.5
Quantile with Uniform 0.8
If we wish to use hephAIstos to create a pipeline with the same data as above, we can go to the hephAIstos main folder and
create a Python file (for example, mypipeline.py). Let us start with an example in which we desire to rescale our brain dataset
using RobustScaler and apply an SVM with RBF kernel. As we did previously, we select “Target” and “TotalGravVol” fea-
tures and take 20% of the data for testing purposes. The 80% of training data and 20% of testing data are not the same as
previously (randomly chosen). Adding cross-validation is always better than focusing only on accuracy. Here, we will obtain
results for a fivefold cross-validation (cv = 5).
Input:
# Import dataset
from data.datasets import brain_train
data = brain_train()
df = data[["Target","TotalGrayVol"]]
Output:
In the Outputs/ folder, we can find the model (svm_rbf.joblib), which can be used later, and a file named
classification_metrics.csv with the results of the pipeline.
To run a pipeline with several rescaling methods, we can use the code below.
Input:
# Import dataset
from data.datasets import brain_train
data = brain_train()
df = data[["Target","TotalGrayVol"]]
Output:
accuracy 0.60 10
macro avg 0.78 0.60 0.52 10
weighted avg 0.78 0.60 0.52 10
SVM_rbf
Rescaling Method StandardScaler
Missing Method None
Extraction Method None
Accuracy 0.6
Precision 0.6
Recall 0.6
F1 Score 0.6
Cross-validation mean 0.75
Cross-validation std 0.055328
accuracy 0.60 10
macro avg 0.78 0.60 0.52 10
weighted avg 0.78 0.60 0.52 10
SVM_rbf
Rescaling Method MinMaxScaler
Missing Method None
Extraction Method None
Accuracy 0.6
Precision 0.6
Recall 0.6
F1 Score 0.6
Cross-validation mean 0.75
Cross-validation std 0.055328
2.1 Feature Rescaling: Structured Continuous Numeric Data 53
accuracy 0.60 10
macro avg 0.78 0.60 0.52 10
weighted avg 0.78 0.60 0.52 10
SVM_rbf
Rescaling Method MaxAbsScaler
Missing Method None
Extraction Method None
Accuracy 0.6
Precision 0.6
Recall 0.6
F1 Score 0.6
Cross-validation mean 0.75
Cross-validation std 0.055328
accuracy 0.60 10
macro avg 0.78 0.60 0.52 10
weighted avg 0.78 0.60 0.52 10
SVM_rbf
Rescaling Method RobustScaler
Missing Method None
Extraction Method None
Accuracy 0.6
Precision 0.6
Recall 0.6
F1 Score 0.6
Cross-validation mean 0.75
Cross-validation std 0.055328
accuracy 0.50 10
macro avg 0.25 0.50 0.33 10
weighted avg 0.25 0.50 0.33 10
54 2 Feature Engineering Techniques in Machine Learning
SVM_rbf
Rescaling Method Normalizer
Missing Method None
Extraction Method None
Accuracy 0.5
Precision 0.5
Recall 0.5
F1 Score 0.5
Cross-validation mean 0.75
Cross-validation std 0.055328
accuracy 0.40 10
macro avg 0.22 0.40 0.29 10
weighted avg 0.22 0.40 0.29 10
SVM_rbf
Rescaling Method Log
Missing Method None
Extraction Method None
Accuracy 0.4
Precision 0.4
Recall 0.4
F1 Score 0.4
Cross-validation mean 0.778571
Cross-validation std 0.065465
accuracy 0.40 10
macro avg 0.22 0.40 0.29 10
weighted avg 0.22 0.40 0.29 10
SVM_rbf
Rescaling Method SquareRoot
Missing Method None
Extraction Method None
Accuracy 0.4
Precision 0.4
Recall 0.4
2.1 Feature Rescaling: Structured Continuous Numeric Data 55
F1 Score 0.4
Cross-validation mean 0.75
Cross-validation std 0.055328
accuracy 0.60 10
macro avg 0.78 0.60 0.52 10
weighted avg 0.78 0.60 0.52 10
SVM_rbf
Rescaling Method Box-Cox
Missing Method None
Extraction Method None
Accuracy 0.6
Precision 0.6
Recall 0.6
F1 Score 0.6
Cross-validation mean 0.75
Cross-validation std 0.055328
accuracy 0.60 10
macro avg 0.78 0.60 0.52 10
weighted avg 0.78 0.60 0.52 10
SVM_rbf
Rescaling Method Yeo-Johnson
Missing Method None
Extraction Method None
Accuracy 0.6
Precision 0.6
Recall 0.6
F1 Score 0.6
Cross-validation mean 0.75
Cross-validation std 0.055328
56 2 Feature Engineering Techniques in Machine Learning
accuracy 0.60 10
macro avg 0.78 0.60 0.52 10
weighted avg 0.78 0.60 0.52 10
SVM_rbf
Rescaling Method Quantile-Gaussian
Missing Method None
Extraction Method None
Accuracy 0.6
Precision 0.6
Recall 0.6
F1 Score 0.6
Cross-validation mean 0.75
Cross-validation std 0.055328
accuracy 0.50 10
macro avg 0.25 0.50 0.33 10
weighted avg 0.25 0.50 0.33 10
SVM_rbf
Rescaling Method Quantile-Uniform
Missing Method None
Extraction Method None
Accuracy 0.5
Precision 0.5
Recall 0.5
F1 Score 0.5
Cross-validation mean 0.721429
Cross-validation std 0.014286
We have explored a few techniques for feature scaling, but there are many more, and there is not an obvious answer
regarding which is the best feature scaling method. Depending on context, it is important to explore different techniques;
there are many parameters to consider. A certitude is that machine learning does not know the difference between the
weight of a basket of strawberries and its price. This would be a “no-brainer” for humans, but machine learning simply
sees numbers. If there is a vast difference in the range, the machine makes the underlying assumption that higher ranging
numbers have superiority. This is particularly true for machine learning algorithms that calculate distances between data
points. In addition, few algorithms converge faster with feature scaling than without it, which is the case for the example of
neural network gradient descent. There are also algorithms that do not really require feature scaling; most of these such as
2.2 Strategies to Work with Categorical (Discrete) Data 57
tree-based algorithms (CART, random forests, gradient-boosted decision tree, etc.) rely on rules. Tree-based algorithms use
series of inequalities. If the standard deviation is small and the distribution is not Gaussian, MinMaxScaler responds well but
is sensitive to outliers. MaxAbsScaler is similar to MinMaxScaler, with the difference that the values are mapped in the range
[0, 1], and it also suffers from large outliers. The StandardScaler is also very sensitive to outliers, and if the data are not
normally distributed, it is clearly not the best scaler. RobustScaler and quantile transformation are more robust to outliers.
The power transformer is a family of parametric and monotonic transformations that can be applied to make data more
Gaussian, stabilizing variance and minimizing skewness through maximum likelihood estimation.
In machine learning, we need to process different types of data. Some of these types are continuous variables, and others are
categorical variables. In a way, we can compare the difference between continuous and categorical data with regression and
classification algorithms, at least for data inputs. As we are using data, it is critically important to consider and process the
categorical data correctly to avoid any incorrect impact on the performance of the machine learning models. We do not
really have a choice here, as in any case we need to transform categorical data, often text, into numeric and usable data
for calculation. Most of the time, we encounter three major classes of categorical data: binary, nominal, and ordinal
(Figure 2.5).
The nominal categorical data attribute indicates that there is no concept of ordering among their values, such as types of
video games (simulation and sport, action-adventure, real-time strategy, shooters, multiplayer online battle arena, etc.) or a
“pet” variable that can have the values of “dog,” “cat,” and “bird.” Ordinal categorical attributes reflect a notion of order
among their values. For example, we can think of the size of T-shirts (small, medium, large) or a podium result (first, second,
third). Both nominal and ordinal categories are often called labels or classes. These discrete values can be text, numbers, or
even images depending on the context. For instance, we can convert a numerical variable to an ordinal variable by dividing
its range into bins and assigning values to each bin; we call this process discretization. A numerical variable between 1 and
50 can be divided into an ordinal variable with five labels (1–10, 11–20, 21–30, 31–40, 41–50). Some algorithms, such as a
decision tree, can work with categorical data directly with no specific transformations. Most algorithms and tools need to
operate on numeric data (for more efficient implementation rather than as hard limitations on the algorithms themselves)
and not label data directly. For instance, if we use scikit-learn, all variables should be in numerical form.
It is preferable to apply continuous and categorical transformations after splitting the data between training and testing, as
illustrated in Figure 2.6. We can choose different encoders for different features. The output, which is encoded data, can be
merged with the rescaled continuous data to train the models. The same process is applied to the test data before applying
the trained models.
Many methods are available to encode categorical variables to feed models. In this chapter, we will investigate the
following well-known encoding methods:
•• Ordinal encoding
One hot encoding
•• Label encoding
Helmert encoding
• Binary encoding
Fold 1 Model 1
Continuous Rescaling
Fold 2 Model 2
Train
data Encoder 1 Fold 3 Model 3
Categorical Encoded
Fold 4 Model 4
Encoder 2 train
data Fold 5 Model 5
Test Encoder 3
data
Model 1 Prediction 1
Final prediction
Continuous Rescaling
Model 2 Prediction 2
Test
data Encoder 1 Model 3 Prediction 3
Categorical Encoded
Model 4 Prediction 4
Encoder 2 train
data Model 5 Prediction 5
Encoder 3
•• Frequency encoding
Mean or target encoding
•• Sum encoding
Weight of evidence encoding
•• James-Stein encoding
M-estimator encoding
Different coding systems exist for categorical variables, including classic encoders that are well known and widely used
(ordinal, one-hot, binary, frequency, hashing), the contrast encoders that encode data by examining different categories
(or levels) of features, such as Helmert or backward difference, and Bayesian encoders, which use the target as a foundation
for encoding. Target, leave-one-out, weight of evidence, James-Stein, and M-estimator are Bayesian encoders. Even though
we already have a good list of encoders to explore, there are many more! It is important to master a couple of them and then
explore further.
Libraries to code these methods, such as scikit-learn or pandas, are available in the open-source world. In addition,
category_encoders is a very useful library for encoding categorical columns:
Depending on the context, some methods are more suitable than others, as shown in Figure 2.7. We will explain and code
each of them.
All code examples for this section can be found in hephaistos/Notebooks/Categorical_transform.ipynb.
2.2 Strategies to Work with Categorical (Discrete) Data 59
Categorical data
import os
import numpy as np
import pandas as pd
Output:
Size Color Class
0 small red 1
1 small green 1
2 large black 1
3 medium white 0
4 large blue 1
5 large red 0
6 small green 0
7 medium black 1
The ordinal encoding transformation is also available in scikit-learn via the OrdinalEncoder class, which by default
will assign integers to labels in the order that is observed in the data. If we need to specify a desired order, we can use
the “categories” argument with the rank order of all expected labels.
Let us encode the “Size” and “Color” features.
Input:
# Display Dataframe
print(df )
Output:
Size Color Class
0 2.0 3.0 1
1 2.0 2.0 1
2 0.0 0.0 1
3 1.0 4.0 0
4 0.0 1.0 1
5 0.0 3.0 0
6 2.0 2.0 0
7 1.0 0.0 1
2.2 Strategies to Work with Categorical (Discrete) Data 61
Output:
Class One_large One_medium One_small One_black One_blue One_green One_red One_white
0 1 0 0 1 0 0 0 1 0
1 1 0 0 1 0 0 1 0 0
2 1 1 0 0 1 0 0 0 0
3 0 0 1 0 0 0 0 0 1
4 1 1 0 0 0 1 0 0 0
5 0 1 0 0 0 0 0 1 0
6 0 0 0 1 0 0 1 0 0
7 1 0 1 0 1 0 0 0 0
enc = OneHotEncoder(handle_unknown='ignore')
enc_df = pd.DataFrame(enc.fit_transform(df[['Size','Color']]).toarray())
df = df.join(enc_df )
print(df )
Blue 1 0 0
Blue 1 0 0
Red 0 1 0
Green 0 0 1
Red
Output:
Size Color Class 0 1 2 3 4 5 6 7
0 small red 1 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0
1 small green 1 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0
2 large black 1 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
3 medium white 0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0
4 large blue 1 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
5 large red 0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
6 small green 0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0
7 medium black 1 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0
Here, we have applied one hot encoding to the “Size” and “Color” variables. The “Size” feature has an ordinality, which
means that applying one-hot encoding is not appropriate. In addition, if we turn [red, green, black, white, blue, red, green,
black] into [1, 2, 3, 4, 5, 1, 2, 3], we impose ordinality on a variable that is not ordinal. There are algorithms such as decision
trees that can handle categorical variables well, meaning that we can optimize memory; however, for other types of algo-
rithms, one-hot encoding can make a major difference. One-hot encoding has the advantage that the result is binary and not
ordinal, meaning that we are in an orthogonal vector space. One way to address dimensionality is to use principal compo-
nent analysis (PCA) as well, just after application of one-hot encoding.
Output:
Size Color Class
0 2 3 1
1 2 2 1
2 0 0 1
3 1 4 0
4 0 1 1
5 0 3 0
6 2 2 0
7 1 0 1
2.2 Strategies to Work with Categorical (Discrete) Data 63
Output:
Size
0 small
1 small
2 small
3 small
4 medium
5 medium
6 medium
7 large
8 large
9 x-large
The Helmert encoding method compares the mean of the dependent variable for “small” with the mean of all of the
subsequent levels of the categorical column (“medium,” “large,” “x-large”), the mean of the dependent variable for
“medium” with the mean of all of the subsequent levels (“large,” “x-large”), and the mean of the dependent variable
for “large” with the mean of all of the subsequent levels (in our case only one level, “x-large”).
Helmert encoding can be implemented with the category_encoders library. We must first import the category_encoders
library after installing it. We invoke the HelmertEncoder function and call the .fit_transform() method on it with the
DataFrame as the argument.
Input:
import category_encoders as ce
enc = ce.HelmertEncoder()
df = enc.fit_transform(df['Size'])
print(df )
64 2 Feature Engineering Techniques in Machine Learning
Output:
intercept Size_0 Size_1 Size_2
We can ignore the intercept (columns with zero variance) by adding the drop_invariant = True option to ce.Helmert-
Encoder(). We can use Helmert encoding when levels of the categorical variable are ordered (smallest to largest, for
instance).
Input:
import category_encoders as ce
enc = ce.BinaryEncoder(cols=['Color','Size'])
df_binary = enc.fit_transform(df )
df_binary
print(df_binary)
Output:
Size_0 Size_1 Color_0 Color_1 Color_2 Class
0 0 1 0 0 1 1
1 0 1 0 1 0 1
2 1 0 0 1 1 1
3 1 1 1 0 0 0
4 1 0 1 0 1 1
5 1 0 0 0 1 0
6 0 1 0 1 0 0
7 1 1 0 1 1 1
frequency = df.groupby('Color').size()/len(df )
df.loc[:,'Frequency'] = df['Color'].map(frequency)
print(df )
66 2 Feature Engineering Techniques in Machine Learning
Output:
Size Color Class Frequency
Output:
Size Color Target
0 small red 1
1 small green 1
2 large black 1
3 medium white 0
4 large blue 1
5 large red 0
6 small green 0
7 medium black 1
In this method, we encode, for each unique value of the categorical feature, based on the ratio of occurrence of the positive
class in the target variable. For the feature “Color,” the value “Red” has two occurrences in the target variable, and one of
those is the positive label. Mean encoding would be 0.5 for the value “Red” (Figure 2.10).
2.2 Strategies to Work with Categorical (Discrete) Data 67
Sum of target
Red 1
Green 1 Size Color Target Mean-encoding
Size Color Target
Black 2 Mean 0 Small Red 1 0.5
0 Small Red 1 1 Small Green 1 0.5
White 0 Red 0.5
1 Small Green 1 2 Large Black 1 1.0
Input:
mean_encoding = df.groupby('Color')['Target'].mean()
df.loc[:,'Mean_encoding'] = df['Color'].map(mean_encoding)
print(df )
Output:
Size Color Target Mean_encoding
The volume of the data is not affected by this method, and it can help in faster learning. It provides more logic to the
data in comparison with simple labeling, as we have a probability of our target variable that is conditional on each value of
the feature. A well-known issue with mean encoding is overfitting. We need to increase regularization with cross-
validation and add random noise to the representation of the category in the dataset. Mean encoding is particularly useful
with gradient boosting trees as it decreases cardinality and reaches better loss with a shorter tree, thus improving the
classification.
68 2 Feature Engineering Techniques in Machine Learning
sum_encoder = SumEncoder()
df_encoded = sum_encoder.fit_transform(df['Size'], df['Target'])
print(df_encoded)
Output:
intercept Size_0 Size_1
0 1 1.0 0.0
1 1 1.0 0.0
2 1 0.0 1.0
3 1 –1.0 –1.0
4 1 0.0 1.0
5 1 0.0 1.0
6 1 1.0 0.0
7 1 –1.0 –1.0
8 1 1.0 0.0
9 1 –1.0 –1.0
n−n+ + a
denominator =
y−y+ + 2∗a
nominator
x k = ln
denominator
The variables y and y+ are the total number of observations and the total number of positive observations, respectively
( y = 1); n and n+ are the number of observations and the number of positive observations, respectively ( y = 1) for a
given value of a categorical column; a is the regularization hyperparameter (selected by a user).
We could replace goods and bads with events (bads) and non-events (goods). Distribution refers to the proportion of goods
or bads in the respective group relative to the column totals. The value of WoE will be 0 if the odds P[Distribution Goods]/P
[Distribution Bads] is equal to 1. In a group, if P[Distribution Goods] < P[Distribution Bads], the odds ratio will be less than
1 and the WoE will be less than 0. On the contrary, if P[Distribution Goods] > P[Distribution Bads], WoE will be a positive
number.
Alternatively, let us say that we have a dataset of people suffering from disease A and that we would like to calculate the
relationship between being male and the possibility of having disease A:
Conventionally, we also calculate the information value (IV), which measures the importance of a feature:
IV = (0.43 − 0.41) × 0.048 = 0.001, which means that gender is not a good predictor for disease A according to the
following table:
In the context of machine learning, WoE is also used for the replacement of categorical values. With one-hot encoding, if
we assume that a column contains five unique labels, there will be five new columns. In such a case, we can replace the
values with the WoE. This method is particularly well suited for subsequent modeling using logistic regression. WoE trans-
formation orders the categories on a “logistic” scale, which is natural for logistic regression.
If we use the category_encoders library, the code will look similar to the following:
It provides the following output by replacing the labels in “Size” by the WoE:
Size
0 0.000000
1 0.000000
2 0.693147
3 –0.693147
4 0.693147
5 0.693147
6 0.000000
7 –0.693147
8 0.000000
9 –0.693147
One of the drawbacks of WoE is the possible loss of information due to binning to relatively few categories. In addi-
tion, it does not consider correction between independent variables as an “univariate” measure. WoE assumes no inter-
actions, which means that the same relationship should hold true across the spectrum of values. It might also lead to
target leakage (and overfitting). On the other hand, it can transform an independent variable, and it establishes mon-
otonic relationship to the dependent variable. WoE is advantageous for variables with too many discrete values
(sparsely populated).
Output:
Size Color Target Proba_Ratio
Inf is due to a division by zero. To avoid this situation, we can replace the denominator with a small value:
or
If we desire four binary features, we can convert the output written in binary and select the last four bits. For example,
hash(“Elsa”) = 18, and in binary 18 = 10,010, which would give us the values 0, 0, 1, 0.
To implement feature hashing in Python, we can use the category_encoders library. Below, we transform the feature “Size”
by selecting three bits in our hash value.
72 2 Feature Engineering Techniques in Machine Learning
Input:
import category_encoders as ce
# n_components contain the number of bits you want in your hash value.
encoder_purpose = ce.HashingEncoder(n_components=3)
df_encoded = encoder_purpose.fit_transform(df['Size'])
print(df_encoded)
Output:
col_0 col_1 col_2
0 0 1 0
1 0 1 0
2 1 0 0
3 1 0 0
4 1 0 0
5 1 0 0
6 0 1 0
7 1 0 0
8 0 1 0
9 1 0 0
We can use different hashing methods using the hash_method option. Any method from hashlib will function (import
hashlib):
Hashing encoding is well suited for categorical variables with a large number of levels and scales very well with low mem-
ory usage.
To implement backward difference encoding in Python, we can use the category_encoders library. Below, we transform
the feature “Size.”
Input:
import category_encoders as ce
encoder = ce.BackwardDifferenceEncoder(cols=['Size'])
df_encoded = encoder.fit_transform(df['Size'])
print(df_encoded)
Output:
intercept Size_0 Size_1
0 1 –0.666667 –0.333333
1 1 –0.666667 –0.333333
2 1 0.333333 –0.333333
3 1 0.333333 0.666667
4 1 0.333333 –0.333333
5 1 0.333333 –0.333333
6 1 –0.666667 –0.333333
7 1 0.333333 0.666667
8 1 –0.666667 –0.333333
9 1 0.333333 0.666667
j i
yj ∗ x j = = k − yi
x ki =
j i
xj = = k
The variables xi and yi are the ith value of categories and target, respectively. The overall calculates the mean target of
category k for observation j if observation j is removed from the dataset.
For the test dataset, a category is replaced with the mean target of the category k in the training dataset:
yj ∗ x j = = k
x ki =
xj = = k
74 2 Feature Engineering Techniques in Machine Learning
As usual, implementation can be performed with the category_encoders library. An example is shown below for the
“Color” feature.
Input:
import category_encoders as ce
encoder = ce.LeaveOneOutEncoder(cols=['Color'])
df_encoded = encoder.fit_transform(df['Color'], df['Target'])
print(df_encoded)
Output:
Color
0 0.333333
1 0.000000
2 1.000000
3 0.500000
4 0.500000
5 0.666667
6 0.000000
7 1.000000
8 0.333333
9 0.666667
an overfit or underfit situation. A large value of B will result in a larger weight of the global mean (underfit), while a low
value of B will result in a larger weight of the condition mean (overfit).
A way to select B is to tune it like a hyperparameter. Charles Stein devised the following solution:
var yk
B=
var yk + var y
If we cannot rely on the estimation of categories mean to target (yi has a high variance), we need to put more weight
on mean(y), the global mean. We also need to assume that variance is the same among all categories and equal to the
global variance of y (which might be a good estimation if we do not have too many unique categorical values). It is
called pooled variance or pooled mode. We could also use an independent model by replacing the variances with
squared standard errors, which will penalize small observation counts. In addition, the fact that we need to use it
on a normal distribution is a serious limitation in classification task. A possible solution is to use a beta distribution
or convert binary targets with log odds.
Input:
import category_encoders as ce
encoder = ce.JamesSteinEncoder(cols=['Color'])
df_encoded = encoder.fit_transform(df['Color'], df['Target'])
print(df_encoded)
Output:
Color
0 0.5
1 0.0
2 1.0
3 0.0
4 1.0
5 0.5
6 0.0
7 1.0
8 0.5
9 0.5
76 2 Feature Engineering Techniques in Machine Learning
n + + prior ∗ m
xk =
y+ + m
where y+ [count(category)] is the total number of positive observations (y = 1), n+ is the number of positive observations
(y = 1) for a given value of a categorical column [count(category) × mean(category)], and prior [mean(target)] is an average
value of target.
The implementation can be performed using the category_encoders library (with an example shown below for the “Color”
feature).
Input:
encoder = ce.MEstimateEncoder(cols=['Color'])
df_encoded = encoder.fit_transform(df['Color'], df['Target'])
print(df_encoded)
Output:
Color
0 0.500000
1 0.166667
2 0.833333
3 0.250000
4 0.750000
5 0.500000
6 0.166667
7 0.833333
8 0.500000
9 0.500000
2.3 Time-Related Features Engineering 77
We have included in the pipeline the row removal method to handle missing values. We go to the hephAIstos folder,
create a new Python file, and code the pipeline as shown below.
Input:
# Run ML Pipeline
ml_pipeline_function(df, output_folder = './Outputs/', missing_method =
'row_removal', test_size = 0.2, categorical = ['binary_encoding','label_encoding'],
features_binary = ['smoker','sex'], features_label = ['region'])
As we have seen, feature engineering requires the use of different methods to rescale continuous data and encode cat-
egorical data. If we add a time component, feature engineering can be more complex to understand, yet time-related
features are found in many fields such as finance, weather forecasting, and healthcare. A time series is a sequence of
numerical values representing the evolution of a specific quantity over time. The data are captured at equal intervals.
Time series are useful to understand the behavior of a variable in the past and use this knowledge to predict the future
behavior of the variable through probabilistic and statistical concepts. With the time variable, we can predict a stock price
based on what happened yesterday, predict the evolution of a disease based on past experiences, or predict the road traffic
in a city if we have data from the last few years. Time series may also reflect seasonality or trends that we can model
mathematically.
To study some techniques for handling time-related features, we will use a dataset used to train models on weather
forecasting in the Indian climate. This dataset provides data from 2013 to 2017 in the city of Delhi, India, with four
parameters: meantemp, humidity, wind_speed, and meanpressure.
The Jupyter Notebook for the examples below is located in hephaistos/Notebooks/Time_related_transformation.ipynb.
Let us load some data to illustrate time-related features engineering.
78 2 Feature Engineering Techniques in Machine Learning
Input:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Load data
csv_data = '../data/datasets/DailyDelhiClimateTrain.csv'
df = pd.read_csv(csv_data, delimiter=',')
print(df.head())
Output:
date meantemp humidity wind_speed meanpressure
Input:
Output:
40
35
30
25
Meantemp
20
15
10
Date
In this chapter, we will examine some scenarios that involve time-related data.
2.3 Time-Related Features Engineering 79
Output:
date meantemp humidity wind_speed meanpressure year month day
We have now loaded the dataset with pandas and have created a DataFrame with new columns (year, month, and
day) for each observation in the series. Of course, we can adjust the time variables to more than just year, month, or day
alone. We can combine time information with other features, such as the season of the month or semester, to improve
the performance of our models. For instance, if our time stamp has hours and we want to study the road traffic, we can
add variables such as business hours and non-business hours or the name of the day in the week. DatetimeIndex from
pandas provides many attributes.
df['lag_1'] = df['meantemp'].shift(1)
df = df[['date', 'lag_1', 'meantemp']]
print(df.head())
80 2 Feature Engineering Techniques in Machine Learning
Output:
date lag_1 meantemp
In this example, we have generated a lag 1 feature for our variable meantemp. We do not really have a justification to
do it here, but if we have data in which we wish to identify a weekly trend, for instance, we can create lag features for
one week (Monday to Monday). We also can create multiple lag features (the sliding window approach). Let us say we
desire lag 1 (t − 1) to lag 5 (t − 5).
Input:
df['lag_1'] = df['meantemp'].shift(1)
df['lag_2'] = df['meantemp'].shift(2)
df['lag_3'] = df['meantemp'].shift(3)
df['lag_4'] = df['meantemp'].shift(4)
df['lag_5'] = df['meantemp'].shift(5)
Output:
date lag_1 lag_2 lag_3 lag_4 lag_5 meantemp
As we can see, the DataFrame contains “NaN,” which means Not a Number. We should discard the first rows of the data
to train the models. We can perform a sensitivity analysis and use different numbers of lags or use lag values from the last
month or last year. If we train a linear regression model, the model will assign weights to the lag features. We can also use an
autocorrelation function (ACF), which measures the correlation between the time series and the lagged version of itself, and
a partial autocorrelation function (PACF), which measures the correlation between the time series with a lagged version of
itself after the elimination of the variations already explained by the intervening comparisons to determine the lag at which
the correlation is significant.
2.3 Time-Related Features Engineering 81
Input:
Output:
Partial autocorrelation
1.00
0.75
0.50
0.25
0.00
–0.25
–0.50
–0.75
–1.00
0 2 4 6 8 10
Autocorrelation
1.00
0.75
0.50
0.25
0.00
–0.25
–0.50
–0.75
–1.00
0 2 4 6 8 10
The partial autocorrelation shows a high correlation with the first and second lag. The ACF shows a slow decrease, mean-
ing that the future values have a very high correlation with past values. As we have seen, lag features are used to understand
the behavior of a target value relative to the past, which may be a day, a week, or a month before. However, we must pay
attention to the use of lag features, which can lead to overfitting if not used properly.
Aggregating features through statistics such as average, standard deviation, maximum, minimum, or skewness might be
valuable additions to predict future behavior. Pandas provides the aggregate method to perform these calculations.
Input:
# Load data
csv_data = './data/DailyDelhiClimateTrain.csv'
df = pd.read_csv(csv_data, delimiter=',')
df['lag_1'] = df['meantemp'].shift(1)
df['lag_2'] = df['meantemp'].shift(2)
df['lag_3'] = df['meantemp'].shift(3)
df['lag_4'] = df['meantemp'].shift(4)
df['lag_5'] = df['meantemp'].shift(5)
82 2 Feature Engineering Techniques in Machine Learning
print(df.head(10))
Output:
Standard
date meantemp humidity wind-speed meanpressure lag_1 lag_2 lag_3 lag_4 lag_5 max min mean
Deviation
2013-
5 7.000000 82.800000 1.480000 1018.000000 6.000000 8.666667 7.166667 7.400000 10.000000 10.000000 6.000000 7.846667 1.531448
01-06
2013-
6 7.000000 78.600000 6.300000 1020.000000 7.000000 6.000000 8.666667 7.166667 7.400000 8.666667 6.000000 7.246667 0.956731
01-07
2013-
2 8.857143 63.714286 7.142857 1018.714284 7.000000 7.000000 6.000000 8.666667 7.166667 8.666667 6.000000 7.166667 0.957427
01-08
2013-
8 14.000000 51.250000 12.500000 1017.000000 8.857143 7.000000 7.000000 6.000000 8.666667 8.857143 6.000000 7.504762 1.219922
01-09
2013-
9 11.000000 62.000000 7.400000 1015.666667 14.000000 8.857143 7.000000 7.000000 6.000000 14.000000 6.000000 8.571429 3.205544
01-10
2013-
10 15.714286 51.285714 10.571429 1016.142857 11.000000 14.000000 8.857143 7.000000 7.000000 14.000000 7.000000 9.571429 2.974380
01-11
2013-
11 14.000000 74.000000 13.228571 1015.571429 15.714286 11.000000 14.000000 8.857143 7.000000 15.714286 7.000000 11.314286 3.581984
01-12
2013-
12 15.833333 75.166667 4.633333 1013.333333 14.000000 15.714286 11.000000 14.000000 8.857143 15.714286 8.857143 12.714286 2.744196
01-13
2013-
13 12.833333 88.166667 0.616667 1015.166667 15.833333 14.000000 15.714286 11.000000 14.000000 15.833333 11.000000 14.109524 1.951916
01-14
2013-
14 14.714286 71.857143 0.528571 1015.857143 12.833333 15.833333 14.000000 15.714286 11.000000 15.833333 11.000000 13.876190 2.036195
01-15
Window
size = 5
2.3 Time-Related Features Engineering 83
To implement this method, we need to define the feature derivation window, which is a rolling window relative to a
forecast point, and a forecast window, which is a range of the future values we want to predict.
Time
Feature derivation Forecast Forecast window
window point
We can calculate the mean of the previous seven values (meantemp in our data) and use that to predict the next value.
Input:
df['rolling_window_mean'] = df['meantemp'].rolling(window=7).mean()
df = df[['date', 'rolling_window_mean', 'meantemp']]
print(df.head(20))
Output:
date rolling_window_mean meantemp
The use of the mean is not required, as we can also consider other metrics such as the sum, minimum value, or maximum
value as features for the selected window. It is also possible to adjust data by using weights that can depend on the time of
observation (e.g., giving higher weights to recent values).
84 2 Feature Engineering Techniques in Machine Learning
# Load data
csv_data = './data/DailyDelhiClimateTrain.csv'
df = pd.read_csv(csv_data, delimiter=',')
df.head()
df['expanding_mean'] = df['meantemp'].expanding(7).mean()
df = df[['date','meantemp', 'expanding_mean']]
print(df.head(20))
Output:
date meantemp expanding_mean
# Load data
csv_data = './data/DailyDelhiClimateTrain.csv'
df = pd.read_csv(csv_data, delimiter=',')
# 1 month difference
df['1month_diff'] = df['meantemp'].diff(periods=1)
# 24 months difference
df['24month_diff'] = df['meantemp'].diff(periods=24)
print(df.head(10))
Output:
date meantemp humidity wind-speed meanpressure 1month_diff 24month_diff
Input:
df['1month_diff'] = df['meantemp'].diff(periods=1)
plt.plot(df['1month_diff'])
plt.show()
df['24month_diff'] = df['meantemp'].diff(periods=24)
plt.plot(df['24month_diff'])
plt.show()
86 2 Feature Engineering Techniques in Machine Learning
Output:
7.5
5.0
2.5
0.0
–2.5
–5.0
–7.5
–10.0
10
–5
–10
–15
0 200 400 600 800 1000 1200 1400
Applying square root, power, or log transformations to create a more normal distribution is also a possibility. If the
data are stationary, which means that mean, variance, and autocorrelation structure do not change over time, we can
consider each point as having been drawn from a normal distribution. We can use the square root to transform our
dataset into a linear trend if the data increase quadratically. If the dataset grows exponentially, log (natural) transfor-
mation will make it linear.
Input:
# Load data
csv_data = './data/DailyDelhiClimateTrain.csv'
df = pd.read_csv(csv_data, delimiter=',')
plt.plot(df['meantemp'])
plt.title("Original")
plt.show()
df['sqrt'] = np.sqrt(df['meantemp'])
plt.plot(df['sqrt'])
plt.title("sqrt")
plt.show()
df['log'] = np.log(df['meantemp'])
plt.plot(df['log'])
plt.title("log")
plt.show()
2.3 Time-Related Features Engineering 87
Output:
Original
40
35
30
25
20
15
10
5
0 200 400 600 800 1000 1200 1400
sqrt
6.0
5.5
5.0
4.5
4.0
3.5
3.0
2.5
0 200 400 600 800 1000 1200 1400
log
3.50
3.25
3.00
2.75
2.50
2.25
2.00
1.75
0 200 400 600 800 1000 1200 1400
The transformation does not really change the shape of the data because the data points neither quadratically nor
exponentially increase.
The techniques above allow us to convert our time series problem into a supervised machine learning problem. How-
ever, we need to pay attention when creating the validation and test sets for time series. Usually, the validation and test
subsets are randomly selected. In time series, a data point is dependent on past values, meaning that if we select our
subsets randomly, we might train our model on future data and predict past values (look-ahead bias). We should split
the time variable to avoid this situation. It is recommended to select training, validation, and testing procedures in that
exact order according to time, with the test set using the most recent data. Recurrent neural networks are the most
traditional and accepted algorithms for problems based on time series forecasting.
88 2 Feature Engineering Techniques in Machine Learning
To transform time series with hephAIstos, we need to provide inputs to the following parameters:
• time_transformation: To transform time series data, we can use different techniques such as lag, rolling window, or
expending window. For example, to use lag, we need to set the time_transformation as follows: time_transformation
= “lag.”
• If rolling_window is selected:
– window_size: An integer indicating the size of the window
– rolling_features: The features for which to apply rolling window.
• If expending_window is selected:
– expending_window_size: An integer indicating the size of the window
– expending_features: To select the features we want to apply expending rolling window
We can go to hephAIstos main folder and create a new Python file to insert the code below.
Input:
# Run ML Pipeline
ml_pipeline_function(df, output_folder = './Outputs/', missing_method = 'row_removal',
test_time_size = 365, time_feature_name = 'date', time_format = "%Y-%m-%d",
time_split = ['year','month','day'], time_transformation='lag',number_of_lags=2,
lagged_features = ['wind_speed', 'meanpressure'], lag_aggregation
= ['min', 'mean'])
The issue of missing data in machine learning has been largely overlooked, but it affects data analysis across a wide range of
domains. The handling of missing values is also a key step in the preprocessing of a dataset, as many machine learning
algorithms do not support missing values. In addition, making the correct decision regarding missing values can generate
robust data models. Missing values are due to different causes such as incomplete extraction, data corruption, or issues in
loading the dataset.
There is a large set of methods to address missing values, ranging from simple ones such as deleting the rows containing
missing values or imputing them for both continuous and categorical variables to more complex ones such as the use of
machine and deep learning algorithms to impute missing values.
Let us create a class called missing in a file named missing.py to address missing values.
2.4 Handling Missing Values in Machine Learning 89
Input:
if j:
print("{} : {:.2f} %".format(i, (j/total_samples)*100))
null_column_list.append(i)
else:
print("None of the columns contains missing values !")
return null_column_list
To remove rows or columns containing missing values, we can use pandas.DataFrame.dropna, which determines
whether rows or columns that contain missing values are removed; “axis = 0” drops rows whereas “axis = 1” drops columns.
Input:
if len(categorical_cols):
for i in categorical_cols:
df_stats_mean.fillna({i : self.df[i].mode()[0]}, inplace=True)
df_stats_median.fillna({i : self.df[i].mode()[0]}, inplace=True)
df_stats_mode.fillna({i : self.df[i].mode()[0]}, inplace=True)
df_linear_interpolation = self.df.copy()
# Linear interpolation for numeric values
print(f'Imputing following columns with linear interpolation :
{numeric_cols}')
if len(numeric_cols):
for i in numeric_cols:
df_linear_interpolation[numeric_cols] = df_linear_interpolation
[numeric_cols].interpolate(method='linear', limit_direction='forward', axis=0)
if len(categorical_cols):
for i in categorical_cols:
df_linear_interpolation[numeric_cols] = df_linear_interpolation
[numeric_cols].interpolate(method='linear', limit_direction='forward', axis=0)
return df_linear_interpolation
• Each multiple imputed dataset is analyzed, and the results are combined.
The use of multiple imputations, as opposed to single imputations, accounts for statistical uncertainty in the imputations
(Figure 2.11).
Multivariate imputation by chained equation (MICE), also known as fully conditional specification, sequential regression
multiple imputation, or Gibbs sampling, has emerged in the statistical literature as one appealing method to replace missing
values via multiple imputations. To make it concrete, let us imagine we have a dataset composed of the variables age, gen-
der, and income. Each of the variables has missing values. The MICE method assumes that these data points are missing at
random (MAR). This assumption considers that the probability of the missing data for a variable is related to other measured
variables but unrelated to the variable with the missing value itself. In our example, this means that the probability that the
income variable is missing depends on other observed values but not on unobserved values for income. In other words, the
income data values are missing perhaps because a certain gender is less likely to respond to a survey. The missing data points
are not related to the level of income itself. We also have other categories, such as missing completely at random (MCAR)
when the data are missing values completely at random, unrelated to any other variables (including the variable with miss-
ing value itself ). Values may be missing because we have lost information during transportation. There is also the missing
not at random (MNAR) category, which indicates that the missing values for a variable are related to the variable with the
missing values itself.
The following steps would be an approach to apply MICE:
• The missing variables are imputed using, for instance, mean imputation (any missing value is replaced by the mean
observed value for that specific variable).
•• The imputed mean values for age are set back to missing.
A linear regression is run for age predicted by income and gender, using all cases in which age was observed.
• The results of this analysis and the prediction of the missing age values are imputed. Age should not have any missing
values at this stage.
2.4 Handling Missing Values in Machine Learning 93
• The previous three steps are repeated for the income variable. The linear regression is performed with income predicted
by age and gender. The same steps are performed for the gender variable with a logistic regression of gender on age and
income.
• The entire process is repeated until convergence. The final set of imputed values constitutes a complete dataset.
The properties of MICE make it particularly useful for large imputation procedures. The technique is highly flexible,
as it can process variables of varying types such as binary or continuous, as well as complexities such as bounds or
survey skip patterns. Depending on the tools we use, the implementation of MICE can vary; we can model each variable
according to its distribution with a multinomial logit model for categorical variables, a Poisson model for count vari-
ables, logistic regression for binary variables, or linear regression for continuous variables. The regression models use
information from all other variables. Residual error is added to create the imputed values and add sampling variability
to the imputations. Residual variance can also be added to the parameter estimates of the regression.
As an example, we could use the fancyimpute library to apply the MICE method on continuous variables.
Input:
def mice(self ):
from fancyimpute import IterativeImputer
mice_imputer = IterativeImputer()
df_mice = mice_imputer.fit_transform(self.df )
return df_mice
completed values of neighboring observations. For example, we can use the mean value from the nearest k neighbors
(n_neighbors) in the dataset. This method can be used for continuous, discrete, and categorical data.
To implement it, we can use either fancyimpute or scikit-learn.
Input:
Let us continue our class missing, created at the beginning of the chapter.
Input:
# using KNN
def knn(self ):
print()
print('Using KNN imputation algorithm...')
from sklearn.impute import KNNImputer
KNN_imputer = KNNImputer(n_neighbors=5)
df_knn = self.df.copy(deep=True)
df_knn.iloc[:, :] = KNN_imputer.fit_transform(df_knn)
return df_knn
We can use many methods, including machine learning algorithms such as XGBoost, LightGBM, or random forests, to
address missing values. We need to explore what is the most appropriate for our context.
To finish our class missing, we can add the following lines to our code:
if inputs.selected_method[0] == 'row_removal':
# applying row removal
try:
df_row = self.row_removal()
print(df_row)
return df_row
except:
print("Something went wrong with df_row: Please Check")
if inputs.selected_method[0] == 'column_removal':
# applying column removal
try:
df_col = self.column_removal()
print(df_col.shape)
2.4 Handling Missing Values in Machine Learning 95
return df_col
except:
print("Something went wrong with df_col: Please Check")
if inputs.selected_method[0] == 'stats_imputation_mean':
# applying statistical imputation Mean + Mode
try:
df_stats_mean = self.stats_imputation_mean(null_column_list)
return df_stats_mean
except:
print("Something went wrong with stats_imputation_mean: Please Check")
if inputs.selected_method[0] == 'stats_imputation_median':
# applying statistical imputation Median + Mode
try:
df_stats_median = self.stats_imputation_median(null_column_list)
return df_stats_median
except:
print("Something went wrong with stats_imputation_median: Please
Check")
if inputs.selected_method[0] == 'stats_imputation_mode':
# applying statistical imputation Median + Mode
try:
df_stats_mode = self.stats_imputation_mode(null_column_list)
return df_stats_mode
except:
print("Something went wrong with stats_imputation_mode: Please Check")
if inputs.selected_method[0] == 'linear_interpolation':
# applying linear interpolation
try:
df_interpolation_linear = self.linear_interpolation(null_column_list)
return df_interpolation_linear
except:
print("Something went wrong with interpolation_linear: Please Check")
if inputs.selected_method[0] == 'mice':
# applying MICE
try:
df_mice = self.mice()
return df_mice
except:
print("Something went wrong with MICE: Please Check")
if inputs.selected_method[0] == 'knn':
# applying KNN
try:
df_knn = self.knn()
return df_knn
96 2 Feature Engineering Techniques in Machine Learning
except:
print("Something went wrong with KNN: Please Check")
# Load data
input_file = './data/missing.csv'
df = pd.read_csv(input_file, delimiter=';')
"""
Selected_method: "row_removal", "column_removal", "stats_imputation_mean",
"stats_imputation_median", "stats_imputation_mode", "linear_interpolation", "mice",
"knn"
"""
selected_method = ['knn']
import os
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import inputs
from missing import missing
df = inputs.df
m = missing(X_train, X_train)
X_train = m.missing_main()
print(X_train)
To address missing values in the sample with hephAIstos, we can go to the hephAIstos main folder, create a new
Python file, and insert the code example below for a new pipeline. The following options are necessary:
Input:
# Import dataset
from data.datasets import neurons
df = neurons()
In feature extraction, the objective is to reduce an initial set of raw data into more manageable groups for running our
machine learning models. We speak about dimension reduction in this context. Feature selection is the process of identify-
ing and selecting the most relevant subset of input features to explain the target variable. It is generally accepted that gen-
erating features in a model can be beneficial, but we also need to take care to exclude irrelevant features. For instance, in the
case of weather data, the scores of the Portuguese or French soccer teams during the year are irrelevant for predicting the
temperature. Both feature selection and feature extraction are used to dimensionally reduce initial raw data, allowing reduc-
tion of the complexity and overfitting of a model. As a data scientist, it is critically important to master these techniques to
delete irrelevant features and reduce overfitting.
Autoencoders, unsupervised learning of efficient data coding, is also a good application in which feature extraction can
help identify important features for coding.
In this section, we will discuss some of the linear and nonlinear dimensionality reduction techniques that are widely
used in a variety of applications, including PCA, independent component analysis (ICA), linear discriminant analysis
(LDA), and locally linear embedding (LLE). Once we understand how these methods work, we can explore many more
methods such as canonical correlation analysis (CCA), singular value decomposition (SVD), CUR matrix decomposi-
tion, compact matrix decomposition (CMD), non-negative matrix factorization (NMF), kernel PCA, multidimensional
scaling (MDS), isomap, Laplacian Eigen map, local tangent space alignment (LTSA), or fast map.
All code examples provided to explain feature extraction can be found in hephaistos/Notebooks/Feature_extraction.ipynb
or https://github.com/xaviervasques/hephaistos/blob/main/Notebooks/Feature_extraction.ipynb.
From a terminal in hephaistos/Notebooks, we can open Jupyter Notebook (the command line is jupyter notebook) and
then open Feature_extraction.ipynb in a browser to run the different code examples.
x − mean x
z=
standard deviation x
We then need to calculate the covariance matrix for the entire dataset. Let us first examine the difference between var-
iance and covariance. Variance is a measure of the variation of a single random variable, for example, the weight of a person
in a population. Covariance will measure how much two random variables vary together, for instance, the weight and the
height of a person in a population.
The variance is calculated by the following formula:
1 n 2
σ 2x = xi − x
n−1i=1
where x is the random variable, n is the number of samples, and x is the mean.
The covariance is given by the following formula:
1 n
σ x, y = x i − x yi − y
n−1i=1
σ x, x σ x, y
C=
σ y, x σ y, y
The diagonal entries are the variances, and the rest are the covariances. We then need to calculate the eigenvalues
and eigenvectors of the covariance matrix. The objective of this step is to identify the principal components that are
new, uncorrelated variables created from linear combinations of the initial variables. Most of the information is con-
centrated in the first components. Let us say we have 20 features; it will give us 20 principal components. The maximum
information is in the first component (the largest possible variance), the maximum remaining information is in the
second, and so on.
Eigenvectors constitute a set of vectors whose direction remains unchanged when we apply a linear transformation, and
an eigenvalue is the factor by which the eigenvector is scaled. If v is an eigenvector of A and λ is the corresponding eigen-
value, we can write the following expression: Av = λv. If we put all eigenvalues in a diagonal matrix L and all eigenvectors in
a matrix V, we can state that CV = VL where C is the covariance matrix, represented as C = VLV−1.
The equation Av = λv can be stated equivalently as (A − λI)v = 0, where I is the identity matrix (n by n) and 0 is the zero
vector. Because v is a nonzero vector, the equation can be resolved if and only if the determinant of the matrix (A − λI) is
zero: |A − λI|.
Calculating the determinant will allow identification of the eigenvalues. We can then resolve the equation (A − λI)v = 0 to
find the eigenvectors using the eigenvalues. The next steps we need to perform are sorting the eigenvalues and their cor-
responding eigenvectors, choosing k eigenvalues, and forming a new matrix of eigenvectors to transform our data (features
matrix × k eigenvectors).
For the purpose of this section, we will use the mushroom classification dataset coming from Kaggle (https://www.
kaggle.com/uciml/mushroom-classification). This dataset includes descriptions of hypothetical samples corresponding
to 23 species of gilled mushrooms in the Agaricus and Lepiota genera, drawn from The Audubon Society Field Guide to
North American Mushrooms (1981). Each species is identified as definitely edible, definitely poisonous, or of unknown
edibility and not recommended. This latter class has been combined with the poisonous one. The Guide clearly states
that there is no simple rule (such as “leaflets three, let it be” for poison oak and ivy) for determining the edibility of a
mushroom.
Now that we have seen an overview of the theory, let us code and use the mushroom data.
Input:
import pandas as pd
csv_data = '../data/datasets/mushrooms.csv'
df = pd.read_csv(csv_data, delimiter=',')
print(df )
Output:
stalk- stalk- stalk-
spore-
cap- cap- cap- gill- gill- gill- gill- surface- color- color- veil- veil- ring- ring-
class bruises odor ... print- population habitat
shape surface color attachment spacing size color below- above- below type color number type
color
ring ring ring
0 p x s n t p f c n k ... s w w p w o p k s u
1 e x s y t a f c b k ... s w w p w o p n n g
2 e b s w t l f c b n ... s w w p w o p n n m
3 p x y w t p f c n n ... s w w p w o p k s u
4 e x s g f n f w b k ... s w w p w o e n a g
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
8119 e k s n f n a c b y ... s o o p o o p b c l
8120 e x s n f n a c b y ... s o o p n o p b v l
8121 e f s n f n a c b n ... s o o p o o p b c l
8122 p k y n f y f c n b ... k w w p w o e w v l
8123 e x s n f n a c b y ... s o o p o o p o c l
100 2 Feature Engineering Techniques in Machine Learning
To proceed, we need to encode the categorical variables. We will use label encoding. One very important thing to do is to
perform feature rescaling, as PCA is highly affected by scale. Using StandardScaler will standardize the features into a unit
scale and limit the effects.
Input:
Output:
array( [ [ 1.02971224, 0.14012794, –0.19824983, ..., –0.67019486,
–0.5143892 , 2.0.3002809],
[ 1.02971224, 0.14012794, 1.76587407, ..., –0.2504706 ,
–1.31310821, –0.29572966],
[–2.08704716, 0.14012794, 1.37304929, ..., –0.2504706 ,
–1.31310821, 0.86714922],
...,
[–0.8403434 , 0.14012794, –0.19824983, ..., –1.50964337,
–2.11182722, 0.28570978],
[–0.21699152, 0.95327039, –0.19824983, ..., 1.42842641,
0.028432981, 0.28570978],
[ 1.02971224, 0.14012794, –0.19824983, ..., 0.16925365,
–2.11182722, 0.28570978]])
The original data have 22 columns that we will reduce to two dimensions with the following few lines, using scikit-learn:
Input:
Output:
0 –0.574322 –0.975780 1
1 –2.282102 0.279064 0
2 –1.858036 –0.270974 0
3 –0.884780 –0.756470 1
4 0.689613 1.239266 0
targets = [1, 0]
colors = ['r', 'b']
Output:
3
Principal component 2
–1
–2
–4 –3 –2 –1 0 1 2 3 4
Principal component 1
The metric to choose the number of components is called the explained variance ratio, which is the percentage of variance
that is contributed by each of the selected components. Ideally, we need to reach a total of 80% to avoid overfitting.
pca.explained_variance_ratio_
The output of the line of code above shows that our first principal component contains 18.5% of the variance and the
second principal component contains 12.4%. Together, the two components contain 30.9% of the information, which is
not sufficient. More components need to be taken.
Although technically we can use PCA on label-encoded data or binary data, it does not perform well and can produce
poor results; PCA is adapted to continuous variables. Squared deviation (minimize variance) is not significant on
categorical data. Alternative methods should be used.
above to separate useful signals from unhelpful ones. Many biological artifacts reflect non-Gaussian processes. We can
also consider improving Siri or Alexa or isolating sounds from others as use cases.
Let us generate some data as sinusoidal (s1), square (s2), or sawtooth (s3) signals. Then, we can standardize and mix the
data with a mixing matrix.
Input:
import numpy as np
import matplotlib.pyplot as plt
from scipy import signal
from sklearn import preprocessing
# Sinusoidal signal
s1 = np.sin(2 * time)
# Square signal
s2 = np.sign(np.sin(2 * time))
# Sawtooth signal
s3 = signal.sawtooth(2 * np.pi * time)
plt.figure(figsize=(10, 10))
models = [S, X]
names = ['Orginal Data',
'Mixed Data']
colors = ['red', 'green', 'blue']
plt.tight_layout()
104 2 Feature Engineering Techniques in Machine Learning
Output:
Original data
2
–1
–2
–5
plt.figure(figsize=(10, 10))
models = [S, X, P]
names = ['Original Data',
'Mixed Data',
'PCA estimatimation']
colors = ['red', 'green', 'blue']
plt.tight_layout()
2.5 Feature Extraction and Selection 105
Output:
Original data
2
–1
–2
–5
As we can see, PCA does not perform well in isolating signals because signals are non-Gaussian processes. If the data only
reflects Gaussian processes, ICA and PCA are equivalent. Let us see now how ICA behaves. One of the simplest ways to code
ICA is to use FastICA from sklearn.decomposition.
Input:
# compute ICA
ica = FastICA(n_components=3)
I = ica.fit_transform(X) # Get the estimated sources
# compute PCA
pca = PCA(n_components=3)
P = pca.fit_transform(X) # estimate PCA sources
plt.figure(figsize=(10, 10))
models = [S, X, P, I]
names = ['Original Data',
'Mixed Data',
'PCA estimation',
106 2 Feature Engineering Techniques in Machine Learning
'ICA estimation']
colors = ['red', 'green', 'blue']
plt.tight_layout()
Output:
Original data
2
–1
–2
–5
As we can see, the signals are easier to isolate. We can also apply ICA after applying PCA to the data.
2.5 Feature Extraction and Selection 107
Input:
import numpy as np
import matplotlib.pyplot as plt
from scipy import signal
from sklearn import preprocessing
from sklearn.decomposition import FastICA, PCA
# Sinusoidal signal
s1 = np.sin(2 * time)
# Square signal
s2 = np.sign(np.sin(2 * time))
# Sawtooth signal
s3 = signal.sawtooth(2 * np.pi * time)
# compute PCA
pca = PCA(n_components=3)
P = pca.fit_transform(X) # estimate PCA sources
plt.figure(figsize=(10, 10))
models = [S, X, I]
names = ['Original Data',
'Mixed Data',
'ICA estimation after PCA']
colors = ['red', 'green', 'blue']
plt.tight_layout()
108 2 Feature Engineering Techniques in Machine Learning
Output:
Original data
2
–1
–2
–5
0.04
0.02
0.00
–0.02
–0.04
–0.06
0 500 1000 1500 2000 2500 3000
If we analyze the mushroom data by applying ICA with two components, we can see the difference from PCA.
Input:
import pandas as pd
csv_data = './data/mushrooms.csv'
df = pd.read_csv(csv_data, delimiter=',')
ica = FastICA(n_components=2)
independent_components = ica.fit_transform(X)
ICA_df = pd.DataFrame(data = independent_components, columns = ['ICA 1', 'ICA 2'])
targets = [1, 0]
colors = ['r', 'b']
Output:
0.01
ICA 2
0.00
–0.01
–0.02
X=
x N,1 x N,M
The first step is to compute the mean of each class μi(1 × M), then compute the total mean of all data μ(1 × M), and then
calculate the between-class matrix SB (M × M) as follows:
c
T
SB = ni μi − μ μi − μ
i=1
W T SB W
arg max
W W T SW W
It can be reformulated as follows:
SW W = λSB W
where λ represents the eigenvalues of W. The solution can be resolved by calculating the eigenvalues and eigenvectors of
the following expression:
−1
W = SW SB
if SW is non-singular.
2.5 Feature Extraction and Selection 111
We can then sort the eigenvectors with the k highest eigenvalues that will be used to construct a lower dimensional space
(Vk). The other eigenvectors (Vk+1, …VM) are neglected.
Each sample, xi in the M-dimensional space can be represented in a k-dimensional space by projecting them onto the
lower dimensional space of LDA as follows: yi = xi Vk.
Let us apply it to a concrete example with the well-known “Iris” dataset deposited on the UCI machine learning repository
(https://archive.ics.uci.edu/ml/datasets/iris).
The dataset contains three classes of 50 instances each and four features (sepal length, sepal width, petal length, petal
width). Each class refers to a type of iris plant (setosa, versicolour, virginica).
Input:
Output:
The “class” column represents y, with 0 being setosa, 1 being versicolor, and 2 being virginica (we can print target_names).
If we translate the DataFrame above into matrices, we can write the following:
x 1sepal length x 1sepal width x 1petal length x 1petal width
x 2sepal length x 2sepal width x 2petal length x 2petal width
X= ,
wsetosa 0
wsetosa 0
y= =
wvirginica 2
If we follow the steps above, we can first start to compute the mean vectors mi of the three classes (setosa, versicolor,
virginica):
where
n
T
Si = x − mi x − mi
x Di
and
n
1
mi = xk
ni x Di
where m is the overall mean, Ni is the sample size of the respective classes, and mi is the sample mean of the respective
classes.
−1
As described above, we will need to solve the generalized eigenvalue problem for the matrix W = SW SB . We can use
NumPy or LAPACK for this purpose. Next, we will sort the eigenvectors by decreasing eigenvalues and choose the k eigen-
vectors with the largest eigenvalues for the new feature subspace. We could do all these steps manually and can find many
examples on the internet if needed, but the simplest way is to use libraries that automatically perform the process for us
such as scikit-learn:
lda = LinearDiscriminantAnalysis(n_components=2)
X_r2 = lda.fit(X, y).transform(X)
To compare with PCA, let us plot both PCA and LDA transformations.
2.5 Feature Extraction and Selection 113
Input:
plt.figure()
colors = ["darkblue", "darkviolet", "darkturquoise"]
linewidth = 2
plt.figure()
for color, i, target_name in zip(colors, [0, 1, 2], target_names):
plt.scatter(
X_lda[y == i, 0], X_lda[y == i, 1], alpha=0.8, color=color, label=target_name
)
plt.legend(loc="best", shadow=False, scatterpoints=1)
plt.title("LDA of IRIS dataset")
plt.show()
114 2 Feature Engineering Techniques in Machine Learning
Output:
1.0
0.5
0.0
–0.5
setosa
–1.0 versicolor
virginica
–3 –2 –1 0 1 2 3 4
–1
–2
As we can see, both approaches can clearly separate the data. If we apply the same approach to another classic dataset
(the wine dataset), we will see that LDA has a serious advantage. Let us only load the wine data, split it into X and y, and
apply the same code as above.
Input:
Output:
20
–20
–2
–4 class_0
class_1
–6 class_2
–6 –4 –2 0 2 4 6
A. isomap
B. LLE
Input manifold
C. L-isomap
Figure 2.12 Comparison of several manifold learners on a Swiss Roll manifold. Color is used to indicate how points in the results
correspond to points on the manifold. The Swiss Roll manifold is a typical example in which we are provided as input some data in a
three-dimensional space with a distribution resembling one of a roll that we can unroll to reduce our data into two-dimensional space.
Source: Gashler et al. (2008).
LLE leverages the local symmetries of linear reconstruction to discover the structure of nonlinear manifolds in high-
dimensional data. To compute LLE and map high-dimensional data points, X i , to low-dimensional embedding vectors,
Y i, we need to follow a few steps. First, we need to compute the neighbors of each data point X i and compute the weights
Wij that best reconstruct each data point X i from its neighbors, minimizing the cost of the following equation by constrained
linear fits:
2
εW = Xi − W ij X j
i j
where Wij is the weight that describes the contribution of the jth data point to the ith reconstruction. In the simplest for-
mulation, we identify K-nearest neighbors per data point, as measured by Euclidean distance. We can also use more sophis-
ticated methodologies such as selecting points with a ball of fixed radius to select neighbors. The equation above sums the
squared distances between all the data points and their reconstruction. To compute Wij, we minimize the cost function using
two constraints:
•• Each data point X i is reconstructed only from its neighbors, meaning that if X j does not belong to this set, Wij = 0.
j W ij = 1 (the row of the weight matrix sum).
The last step is to compute the vectors Y i best reconstructed by the weights Wij, minimizing the quadratic form by its
bottom nonzero eigenvectors in the following equation:
2
ϕY = Yi − W ij Y j
i j
where the low-dimensional vector Y i represents the global internal coordinates on the manifold. The equation can be mini-
mized by solving a sparse N × N eigenvector problem, whose bottom d (we minimize the embedding cost function by choos-
ing d-dimensional coordinates of Y i) nonzero eigenvectors provide an ordered set of orthogonal coordinates centered on the
origin.
To apply LLE and its variants, we will use the digits dataset, in which each data point is an 8 × 8 image of a numerical digit
(1797 samples, 64 features, 180 samples per class). Let us first represent an extract of the digits dataset.
2.5 Feature Extraction and Selection 117
Input:
digits = load_digits()
X = digits.data
y = digits.target
n_samples = 1797
n_features = 64
n_neighbors = 30
Output:
We then define a plot_embedding_function to plot LLE, LLE variants, or other embedding techniques we will see in this
chapter. This function will allow us to display and control whether the digits are grouped together or scattered in the embed-
ding space.
118 2 Feature Engineering Techniques in Machine Learning
Input:
import numpy as np
from matplotlib import offsetbox
from sklearn.preprocessing import MinMaxScaler
ax.set_title(title)
ax.axis("off")
embeddings = {
"Standard LLE": LocallyLinearEmbedding(
n_neighbors=n_neighbors, n_components=2, method="standard"
),
}
2.5 Feature Extraction and Selection 119
Once we have declared the LLE method, we can perform the projection of the original data. In addition, in the example
provided by scikit-learn (manifold learning on handwritten digits) to illustrate various embedding techniques on the digits
dataset, the computational time to perform each projection can be stored.
Input:
print(f"{name}...")
start_time = time()
projections[name] = transformer.fit_transform(data, y)
timing[name] = time() - start_time
Output:
Standard LLE (time 0.641s)
As we can see, data points are quite scattered; this process took 0.641 seconds on my personal computer. If we change the
number of neighbors (n_neighbors), we can see important changes.
120 2 Feature Engineering Techniques in Machine Learning
In the code above, we rescale the data using MinMaxScaler. If we return to n_neighbors = 30 and change the rescaling
method, we also notice some changes.
2.5 Feature Extraction and Selection 121
Output (StandardScaler):
Standard LLE (time 0.661s)
Output (RobustScaler):
Standard LLE (time 0.605s)
In the literature, we can find variants of LLE algorithms such as Hessian locally linear embedding (HLLE), modified
locally linear embedding (MLLE), or local tangent space alignment (LTSA). These variants can solve a well-known issue
with LLE, which is the regularization problem. HLLE achieves linear embedding by minimizing the Hessian function on the
manifold (a Hessian-based quadratic form at each neighborhood used to recover the locally linear structure). The scaling
is not optimal with increased data size but tends to give higher quality results compared to standard LLE. MLLE addresses
the regularization problem by using multiple weight vectors in each neighborhood. In LTSA, PCA is applied on the neigh-
bors to construct a locally linear patch considered to be an approximation of the tangent space at the point. A coordinate
122 2 Feature Engineering Techniques in Machine Learning
representation of the neighbors is provided by the tangent space and provides a low-dimensional representation of
the patch.
Let us project the variants by modifying the code above.
Input:
embeddings = {
"Standard LLE": LocallyLinearEmbedding(
n_neighbors=n_neighbors, n_components=2, method="standard"
),
"Modified LLE": LocallyLinearEmbedding(
n_neighbors=n_neighbors, n_components=2, method="modified"
),
"Hessian LLE": LocallyLinearEmbedding(
n_neighbors=n_neighbors, n_components=2, method="hessian"
),
"LTSA LLE": LocallyLinearEmbedding(
n_neighbors=n_neighbors, n_components=2, method="ltsa"
),
}
Output:
Standard LLE (time 0.640s) Modified LLE (time 1.264s)
The output demonstrates the differences among the variants. HLLE and MLLE provided better results in comparison to
the standard LLE, although the standard LLE took less time to compute.
2.5 Feature Extraction and Selection 123
end
end
Figure 2.13 A simple version of the t-SNE algorithm described by Laurens van der Maaten and Geoffrey Hinton, who introduced t-SNE.
Source: van der Maaten and Hinton (2008).
digits = load_digits()
X = digits.data
y = digits.target
target_digits = digits.target_names
plt.figure(figsize=(8, 8))
colors = ["darkblue", "darkviolet", "darkturquoise", "black", "red", "pink",
"darkseagreen", "cyan", "grey", "darkorange"]
linewidth = 0.5
Output:
30
4
20
2
10
0
–2
0 0
–10 1 1
2 2
3
–4 3
4 4
–20 5 5
6 6
7 –6 7
8 8
9 9
–30
–30 –20 –10 0 10 20 30 –8 –6 –4 –2 0 2 4 6
t-SNE applied to digits dataset
60 0
1
2
3
40 4
5
6
7
8
20 9
–20
–40
–60
–40 –20 0 20 40 60
Input:
random_state=0,
),
"NCA embedding": NeighborhoodComponentsAnalysis(
n_components=2, init="pca", random_state=0
),
}
Output:
Random projection embedding (time 0.003s) Truncated SVD embedding (time 0.006s)
Linear discriminant analysis embedding (time 0.019s) lsomap embedding (time 2.858s)
Standard LLE embedding (time 0.617s) Modified LLE embedding (time 1.162s)
Hessian LLE embedding (time 1.254s) LTSA LLE embedding (time 0.835s)
For practical reasons, we can also simply work with our variables without data visualization, as most of the time we are
manipulating training and test data.
Input:
n_jobs=2,
random_state=0,
)
X_tsne = tsne.fit_transform(X)
• features_extraction: Selects the feature extraction method. The following options are available:
– pca
– ica
– icawithpca
– lda_extraction
– random_projection
– truncatedSVD
– isomap
– standard_lle
– modified_lle
– hessian_lle
– ltsa_lle
– mds
– spectral
– tsne
– nca
•• number_components: The number of principal components we want to keep for PCA, ICA, LDA, or other methods.
n_neighbors: The number of neighbors to consider for manifold learning techniques.
If we go to the hephAIstos main folder and create a new Python file to create a new pipeline, we can copy and paste the
following lines to test the routine:
# Import Data
from data.datasets import neurons
df = neurons()
# Run ML Pipeline
ml_pipeline_function(df, output_folder = './Outputs/', missing_method =
'row_removal', test_size = 0.2, categorical = ['label_encoding'],features_label =
['Target'], rescaling = 'standard_scaler', features_extraction = 'pca',
number_components = 2)
Here, we have decided to create a pipeline and to use row_removal to address missing data, to encode the “Target” feature
with the label encoding method, to rescale the data using StandardScaler, and to use PCA with two components for feature
extraction.
2.5 Feature Extraction and Selection 131
Filter
Set of all Best subset Learning
Performance
features selection algorithm
Wrapper
Subset selection
Embedded
Subset selection
All code examples provided to explain feature selection can be found in hephaistos/Notebooks/Features_selection.ipynb
or https://github.com/xaviervasques/hephaistos/blob/main/Notebooks/Features_selection.ipynb. If we go to hephaistos/
Notebooks in a terminal and open Jupyter Notebook (the command line is jupyter notebook), we can then open
Feature_selection.ipynb in a browser and run the different code examples.
Feature Selection Using Statistical Tests Statistical tests such as z-tests, t-tests, ANOVA, or correlation tests are commonly
used to select features. Before applying statistical tests for feature selection, we need to understand some vocabulary asso-
ciated with them. In statistics, hypothesis testing refers to the act of testing an assumption regarding a population feature by
using sample data. In this case, we test the null hypothesis and the alternative hypothesis. In the null hypothesis, denoted
H0, we hypothesize that there is no significant difference between sample and population or among different populations
(for example, the mean of two samples is equal). In the alternative hypothesis, denoted H1, we hypothesize that there is a
significant difference (for example, the mean of the two samples is not equal). In statistics, to decide whether we can reject
the null hypothesis, we need to calculate a test statistic that provides a numerical value. There are two approaches: the
critical value and the p-value. The critical value is a line on a curve splitting it into one or two sections that are the rejection
regions. In other words, if the test statistic falls into one of the sections, we can reject the null hypothesis. On the contrary, if
the test statistic does not fall into those regions (not extreme as the critical value), we cannot reject the null hypothesis. The
critical value is calculated from a defined significance level α and the type of probability distribution of the ideal model.
In Figures 2.15–2.17, we can see a two-sided, a left-tailed, and a right-tailed test, respectively. In the examples shown in the
figures, the idealized model is a normal probability distribution.
H0 μ = μ0 ; H1 μ μ0
H0 μ = μ0 ; H1 μ < μ0
H0 μ = μ0 ; H1 μ > μ0
The probability value (p-value) of the test statistic is compared to the defined significance level (α). Smaller p-values pro-
vide stronger evidence to reject the null hypothesis (p ≤ 0.05 is considered strong evidence). If the p-value is less than or
equal to α, the null hypothesis H0 is rejected. If the p-value is greater than α, H0 is not rejected.
Another term to keep in mind is degrees of freedom, which is the number of independent variables and is used to calculate
the t-statistic and the chi-squared statistic.
2.5 Feature Extraction and Selection 133
1–α
Critical Critical
value value
α α
2 2
1–α
Critical
value
1–α
Critical
value
In summary, there are several operations we need to perform for statistical tests, starting with calculating a statistical
value from a mathematical formula, then calculating the critical value using statistical tables, calculating the p-value,
and checking whether p ≤ 0.05 to accept or reject the null hypothesis.
For the filter methods, a classical way to begin is to consider the type of data we have. For instance, if both our input and
output are categorical data, the most often used technique is the chi-squared test (chi2() in scikit-learn). We can also use
mutual information (information gain). If our input is a continuous variable and our output is a categorical variable,
ANOVA (f_classif() in scikit-learn), correlation coefficient (linear), or Kendall’s rank coefficient (nonlinear) is appropriate.
Pearson’s (r_regression() in scikit-learn) correlation coefficient (linear) and Spearman’s rank coefficient (nonlinear) can be
used for both numerical input and output. SciPy library can also be used to implement all statistics.
To summarize, all these objects return univariate scores and p-values. For regression tasks, we can use f_regression and
mutual_info_regression; for classification tasks, we can use chi2, f_classif, and mutual_info_classif options in scikit-learn.
Chi-Squared TestLet us now detail a feature selection method under the filter category, namely the chi-squared ( χ 2) test,
which is commonly used in statistics to test statistical independence between two or more categorical variables. A and B are
two independent events if:
P AB = p A P B or, equivalently P A B = P A and P B A = P B
To correctly apply a chi-squared test, the data must be non-negative, sampled independently, and categorical, such as
Booleans or frequencies (greater than 5, as chi-square is sensitive to small frequencies). The test is not appropriate for
continuous variables or for comparison of categorical and continuous variables. In statistics, if we want to test the cor-
relation (dependence) of two continuous variables, Pearson correlation is commonly used. When one variable is con-
tinuous and the other one is ordinal, Spearman’s rank correlation is appropriate. It is possible to apply the chi-squared
test on continuous variables after binning the variables. The idea behind the chi-squared test is to calculate a p-value in
order to identify a high correlation (low p-value) between two categorical variables, indicating that the variables are
dependent on each other (p < 0.05). The p-value is calculated from the chi-squared score and degrees of freedom.
In feature selection, we calculate chi-squared between each feature and the target. We can select the desired number of
features based on the best chi-squared scores.
The test statistic for the chi-squared test is:
2
Oi − E i
χ 2c =
Ei
where O is the observed values, E is the expected values, and the degrees of freedom c = k − r, where k is the number of
groups for which observed and expected frequencies are available and r is the number of restrictions or constraints imposed
on the given comparison.
Let us be more concrete and calculate chi-squared in a concrete and simple example. Let us assume we have 3145 voters
and that we want to assess whether gender influences the choice of a politician:
Applying this calculation to all the data gives the following expected values:
2
Oi − E i
Now, we can apply the general formula χ 2c = :
Ei
χ 2c = 41 1652822
As described above, we need to calculate the degrees of freedom. In our case, c = (number of columns − 1) × (number of
rows − 1) = (3 − 1) × (2 −1) = 2.
We can now check in the chi-squared distribution table (Figure 2.18). By comparing our chi-squared statistic (41.165) to
the one in the table for an alpha level of 0.05 and two degrees of freedom (5.991), we can see that our obtained statistic is
higher than the one in the table (critical value), meaning that we can reject the null hypothesis (H0) and conclude that some
evidence exists for an association between gender and choice of politician.
To implement chi-squared test for feature selection, let us use the mushroom dataset.
Input:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
csv_data = '../data/datasets/mushrooms.csv'
df = pd.read_csv(csv_data, delimiter=',')
.95
0 31.410 χ2
20
p(χ2 ≤ 31.410) = .95
20
Figure 2.18 Percentiles of the chi-squared distribution. Source: Daniel and Cross (2018).
Output:
(5686, 22)
As we can see by printing the shape of the data, we have 22 features. We can perform a chi-squared test on the samples to
retrieve only the two best features using SelectKBest, which will remove all but the two highest-scoring features.
2.5 Feature Extraction and Selection 137
Input:
# Perform a chi-square test to the samples to retrieve only the two best features
Output:
(5686, 2)
It is also possible to use SelectPercentile to select features according to a percentile of the highest scores:
ANOVA F-Value As we have seen, if the features are categorical, we can choose to examine chi-squared statistics between the
features and the target vector. If the features are continuous variables and the target vector is categorical, the analysis of
variance (ANOVA) F-statistic can be calculated to determine whether the means of each group (features by the target vector)
are significantly different. When we run an ANOVA test or a regression analysis, we can compute an F-statistic (F-test) to
statistically assess the equality of means. An F-test is similar to a t-test, which determines whether a single feature is sta-
tistically different; the F-test will determine whether the means of three or more groups are different. In fact, when we apply
the ANOVA F-test to only two groups, F = t2 where t is the student’s t-statistic. Similar to other statistical tests, we will obtain
an F-value, a critical F-value, and a p-value. For the one-way ANOVA F-test statistic, the formula is the following:
K
1 2
ni X i − X
K −1i=1
F= K ni
1 2
X ij − X i
N −K i=1j=1
The numerator is the explained variance or between-group variability, and the denominator is the unexplained variance
or within-group variability. X i is the sample mean in the ith group, X is the overall mean of our data, K is the number of
groups, ni is the number of observations in the ith group, Xij is the jth observation in the ith out of K groups, and N is the
overall sample size. The test statistic is compared to a quantile in the F-distribution. In the F-test, the data values need to be
independent and normally distributed with a common variance.
In linear regression, the F-test can be used to determine whether we are able to improve our linear regression model by
making it more complex (adding more linear regression variables) or if it would be better to swap our complex model with
an intercept-only model (simple linear regression model).
The unrestricted model is the following:
β1 x 1 + β2 x 2 + β3 x 3 + β4 x 4 + β 5 x 5 + β0 = y
The restricted model is the following:
β1 x 1 + β2 x 2 + β3 x 3 + β0 = y
If we consider two regression models (model 1 and model 2) for which model 1 has k1 parameters and model 2 has k2
parameters with k1 < k2, then model 1 (restricted model) is the simpler version of model 2 (unrestricted model).
We can then calculate the F-statistic as follows:
RSS1 − RSS2
k2 − k1
F=
RSS2
n − k2
where RSS1 is the residual sum of squares of fitted model 1, RSS2 is the residual sum of squares of fitted model 2, and n is the
number of data samples.
To use an ANOVA F-test to select features, let us use a well-known binary classification dataset that is a copy of the UCI
ML Breast Cancer Wisconsin (Diagnostic) dataset. It is composed of two classes (WDBC-Malignant, WDBC-Benign) and
30 numeric attributes, with 569 samples (212 for malignant and 357 for benign). The features have been computed from
a digitized image of a fine needle aspirate (FNA) of a breast mass and describe characteristics of the cell nuclei present
in the image.
Input:
Output:
(398, 30)
Let us now apply some Python code to select features according to the two best ANOVA F-values.
Input:
# Create an SelectKBest object to select features with two best ANOVA F-Values
fvalue_selector = SelectKBest(f_classif, k=2)
# Apply the SelectKBest object to the training features (X_train) and target (y_train)
X_train_f_classif = fvalue_selector.fit_transform(X_train, y_train)
print(X_train_f_classif.shape)
2.5 Feature Extraction and Selection 139
Output:
(398, 2)
For the regression task, we can use a medical cost personal dataset (https://www.kaggle.com/mirichoi0218/insurance),
which can be used for insurance forecasts by linear regression. In the dataset, we will find costs billed by health insurance
companies (insurance charges) and features (age, gender, BMI, children, smoking status).
Let us examine the head of the data:
age sex bmi children smoker region charges
Input:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
csv_data = '../data/datasets/insurance.csv'
df = pd.read_csv(csv_data, delimiter=',')
# Concatenate dataframes
df = pd.concat([df, df_encoded], axis=1)
print(X.shape)
print(X_train.shape)
# Create an SelectKBest object to select features with two best ANOVA F-Values
f_value = SelectKBest(f_regression, k=2)
# Apply the SelectKBest object to the training features (X_train) and target (y_train)
X_train_f_regression = f_value.fit_transform(X_train, y_train)
print(X_train_f_regression.shape)
Output:
(1338, 6)
(936, 6)
(936, 2)
Pearson Correlation Coefficient We can use the Pearson correlation coefficient (r) to measure the linear relationship between
two or more variables or, in other words, how much we can predict one variable from another. We can use this number for
feature selection with the idea that the variables to keep are those that are highly correlated with the target and uncorrelated
among themselves. The Pearson correlation coefficient is a number between −1 and 1. A value close to 1 indicates a strong
positive correlation (if r = 1, there is a perfect linear correlation between two variables). A value close to −1 means a strong
negative correlation (r = −1 indicates a perfect inverse linear correlation). Values close to 0 indicate weak correlation
(0 means no linear correlation at all) (Figure 2.19).
To compute the linear coefficient of correlation r, the following formula can be used:
Xi − X Yi − Y
r=
Y 2 2
Xi − X Yi − Y
where r is the Pearson correlation coefficient, Xi and Yi are the X and Y variable
samples, respectively, and X and Y are the means of the values of the X and Y
variables, respectively.
In scikit-learn, we can use r_regression. The difference from f_regression, which
is derived from r_regression, is that f_regression produces values in the range [0, 1]
rather than [−1, 1] and also provides p-values, in contrast to r_regression. If all the
features are positively correlated with the target, f_regression and r_regression will
rank features in the same order.
X To implement feature selection based on the coefficient of correlation r, we will
Figure 2.19 Perfect inverse linear use the Breast Cancer Dataset from the UCI ML Breast Cancer Wisconsin
correlation (r = –1). (Diagnostic).
2.5 Feature Extraction and Selection 141
Input:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
breastcancer = '../data/breastcancer.csv'
df = pd.read_csv(breastcancer, delimiter=';')
# Encode the data and drop original column from df + remove id variable
enc = LabelEncoder()
df_encoded = df[['diagnosis']].apply(enc.fit_transform)
df = df.drop(['diagnosis',"id"], axis = 1)
# Concatenate dataframes
df = pd.concat([df, df_encoded], axis=1)
Output:
142 2 Feature Engineering Techniques in Machine Learning
Output:
Correlation heatmap of breast cancer dataset
1.0
radius_mean
texture_mean
perimeter_mean
area_mean
smoothness_mean 0.8
compactness_mean
concavity_mean
concave points_mean
symmetry_mean
0.6
fractal_dimension_mean
radius_se
texture_se
perimeter_se
area_se 0.4
smoothness_se
compactness_se
concavity_se
concave points_se
symmetry_se 0.2
fractal_dimension_se
radius_worst
texture_worst
perimeter_worst
0.0
area_worst
smoothness_worst
compactness_worst
concavity_worst
concave points_worst –0.2
symmetry_worst
fractal_dimension_worst
concavity_mean
concave points_mean
radius_se
compactness_se
concavity_se
concave points_se
radius_worst
compactness_worst
concavity_worst
concave points_worst
texture_mean
perimeter_mean
area_mean
smoothness_mean
symmetry_mean
fractal_dimension_mean
perimeter_se
area_se
smoothness_se
fractal_dimension_se
perimeter_worst
area_worst
smoothness_worst
fractal_dimension_worst
radius_mean
compactness_mean
texture_se
symmetry_se
texture_worst
symmetry_worst
As stated above, the main idea in feature selection is to retain the variables that are highly correlated with the target
(“diagnosis” in our case) and keep features that are uncorrelated among themselves. For instance, we can search for the
index of feature columns with correlation greater than 0.8.
Input:
import numpy as np
# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
# Find index of feature columns with correlation greater than 0.8
to_drop = [column for column in upper.columns if any(upper[column] > 0.8)]
print(to_drop)
Output:
['perimeter_mean', 'area_mean', 'concavity_mean', 'concave points_mean',
'perimeter_se', 'area_se', 'concavity_se', 'fractal_dimension_se', 'radius_worst',
'texture_worst', 'perimeter_worst', 'area_worst', 'smoothness_worst',
'compactness_worst', 'concavity_worst', 'concave points_worst',
'fractal_dimension_worst']
Now that we have identified the features to drop, we can drop them from the original dataset.
144 2 Feature Engineering Techniques in Machine Learning
Input:
From the new generated dataset (df_new), we now have to set an absolute value for the threshold to select features that are
correlated with the target. Let us choose 0.5.
Input:
Output:
Correlation heatmap of breast cancer new dataset
1.0
radius_mean 1.0 0.3 0.2 0.5 0.1 –0.3 0.7 –0.1 –0.2 0.2 0.4 –0.1 0.2
texture_mean 0.3 1.0 –0.0 0.2 0.1 –0.1 0.3 0.4 0.0 0.2 0.2 0.0 0.1
0.8
smoothness_mean 0.2 –0.0 1.0 0.7 0.6 0.6 0.3 0.1 0.3 0.3 0.4 0.2 0.4
compactness_mean 0.5 0.2 0.7 1.0 0.6 0.6 0.5 0.0 0.1 0.7 0.6 0.2 0.5
0.6
symmetry_mean 0.1 0.1 0.6 0.6 1.0 0.5 0.3 0.1 0.2 0.4 0.4 0.4 0.7
fractal_dimension_mean –0.3 –0.1 0.6 0.6 0.5 1.0 0.0 0.2 0.4 0.6 0.3 0.3 0.3
0.4
radius_se 0.7 0.3 0.3 0.5 0.3 0.0 1.0 0.2 0.2 0.4 0.5 0.2 0.1
texture_se –0.1 0.4 0.1 0.0 0.1 0.2 0.2 1.0 0.4 0.2 0.2 0.4 –0.1
0.2
smoothness_se –0.2 0.0 0.3 0.1 0.2 0.4 0.2 0.4 1.0 0.3 0.3 0.4 –0.1
compactness_se 0.2 0.2 0.3 0.7 0.4 0.6 0.4 0.2 0.3 1.0 0.7 0.4 0.3
0.0
concave points_se 0.4 0.2 0.4 0.6 0.4 0.3 0.5 0.2 0.3 0.7 1.0 0.3 0.1
symmetry_se –0.1 0.0 0.2 0.2 0.4 0.3 0.2 0.4 0.4 0.4 0.3 1.0 0.4
–0.2
symmetry_worst 0.2 0.1 0.4 0.5 0.7 0.3 0.1 –0.1 –0.1 0.3 0.1 0.4 1.0
radius_mean
texture_mean
smoothness_mean
compactness_mean
symmetry_mean
fractal_dimension_mean
radius_se
texture_se
smoothness_se
compactness_se
concave points_se
symmetry_se
symmetry_worst
2.5 Feature Extraction and Selection 145
Input:
Output:
Correlation heatmap of breast cancer new dataset
1.0
radius_mean 1.0 0.3 0.2 0.5 0.1 –0.3 0.7 –0.1 –0.2 0.2 0.4 –0.1 0.2 0.7
texture_mean 0.3 1.0 –0.0 0.2 0.1 –0.1 0.3 0.4 0.0 0.2 0.2 0.0 0.1 0.4
0.8
smoothness_mean 0.2 –0.0 1.0 0.7 0.6 0.6 0.3 0.1 0.3 0.3 0.4 0.2 0.4 0.4
compactness_mean 0.5 0.2 0.7 1.0 0.6 0.6 0.5 0.0 0.1 0.7 0.6 0.2 0.5 0.6
0.6
symmetry_mean 0.1 0.1 0.6 0.6 1.0 0.5 0.3 0.1 0.2 0.4 0.4 0.4 0.7 0.3
fractal_dimension_mean –0.3 –0.1 0.6 0.6 0.5 1.0 0.0 0.2 0.4 0.6 0.3 0.3 0.3 –0.0
radius_se 0.7 0.3 0.3 0.5 0.3 0.0 1.0 0.2 0.2 0.4 0.5 0.2 0.1 0.6 0.4
texture_se –0.1 0.4 0.1 0.0 0.1 0.2 0.2 1.0 0.4 0.2 0.2 0.4 –0.1 –0.0
smoothness_se –0.2 0.0 0.3 0.1 0.2 0.4 0.2 0.4 1.0 0.3 0.3 0.4 –0.1 –0.1 0.2
compactness_se 0.2 0.2 0.3 0.7 0.4 0.6 0.4 0.2 0.3 1.0 0.7 0.4 0.3 0.3
concave points_se 0.4 0.2 0.4 0.6 0.4 0.3 0.5 0.2 0.3 0.7 1.0 0.3 0.1 0.4
0.0
symmetry_se –0.1 0.0 0.2 0.2 0.4 0.3 0.2 0.4 0.4 0.4 0.3 1.0 0.4 –0.0
symmetry_worst 0.2 0.1 0.4 0.5 0.7 0.3 0.1 –0.1 –0.1 0.3 0.1 0.4 1.0 0.4
–0.2
diagnosis 0.7 0.4 0.4 0.6 0.3 –0.0 0.6 –0.0 –0.1 0.3 0.4 –0.0 0.4 1.0
radius_mean
texture_mean
smoothness_mean
compactness_mean
symmetry_mean
fractal_dimension_mean
radius_se
texture_se
smoothness_se
compactness_se
concave points_se
symmetry_se
symmetry_worst
diagnosis
146 2 Feature Engineering Techniques in Machine Learning
Features to retain:
radius_mean 0.730029
compactness_mean 0.596534
radius_se 0.567134
diagnosis 1.000000
As we can see from using the correlation coefficient r, the features to retain in our future model are radius_mean, com-
pactness_mean, and radius_se.
We can also compute Pearson’s r for each feature and the target using r_regression from scikit-learn.
Filter Methods: Many More Possibilities As we have seen, the filter methodologies are based on univariate metrics that allow
us to select features based on a ranking such as variance, chi-squared, correlation coefficient, or information gain (mutual
information). We can also use other methods not described above such as missing value ratio in which we compute the
number of missing values in each feature divided by the total number of observations. After defining a threshold, we
can eliminate a column or not. Instead of using the variance, we can remove the square and alternatively calculate the mean
absolute difference. Another possibility is to assess the dispersion ratio between the arithmetic mean and the geometric
mean for a specific feature and retain features with higher dispersion ratios or to determine whether two variables are mutu-
ally dependent by calculating the amount of information that a feature contributes to making the prediction for the other
feature such as the target.
There are many ways to select features with filter methods. The advantage of filter methods is that they are model-agnostic
and fast to compute. The disadvantage is that they examine individual features only, which can be an issue. A feature may
not be useful on its own but it may have much more influence on the target if combined with others.
For classification, we can compute the area under the receiver operating characteristic curve (ROC AUC) from prediction
scores and use it to check or visualize the performance of a classification problem. To summarize, the method indicates how
well a model can distinguish between the different classes, with ROC being a probability curve and AUC a measure of sep-
arability (the higher the AUC, the better the model). We could also choose accuracy, precision, recall, f1-score, or another
metric. For regression, we can compute R-squared, which indicates the strength of the relationship between our model and
the dependent variable.
To implement wrapper methods and perform feature selection, we will analyze a dataset used to recognize fraudulent
credit card transactions (https://www.kaggle.com/mlg-ulb/creditcardfraud). The dataset contains transactions made by
credit cards in September 2013 by European cardholders over two days, in which we can find 492 frauds (0.172%) out
of 284,807 transactions. The data have been transformed using PCA. The principal components (V1, V2, V3, …) are the
new features except for “Time” and “Amount,” which are the original ones. The feature “Class” is the response variable
with a value of 1 in case of fraud and 0 otherwise.
Input:
import pandas as pd
csv_data = '../data/datasets/creditcard.csv'
df = pd.read_csv(csv_data, delimiter=',')
df.head()
Output:
Time V1 V2 V3 V4 V5 V6 V7 V8 V9 ... V21 V22 V23 V24 V25 V26 V27 V28 Amount Class
0 0.0 –1.359807 –0.072781 2.536347 1.378155 –0.338321 0.462388 0.239599 0.098698 0.363787 ... –0.018307 0.277838 –0.110474 0.066928 0.128539 –0.189115 0.133558 –0.021053 149.62 0
1 0.0 1.191857 0.266151 0.166480 0.448154 0.060018 –0.082361 –0.078803 0.085102 –0.255425 ... –0.225775 –0.638672 0.101288 –0.339848 0.167170 –0.125895 –0.008983 0.014724 2.69 0
2 1.0 –1.358354 –1.340163 1.773209 0.379780 –0.503198 1.800499 –0.791461 0.247676 –1.514654 ... 0.247998 0.771679 0.909412 –0.689281 –0.327642 –0.139097 –0.055353 –0.059752 378.66 0
3 1.0 –0.966272 –0.185226 1.792993 –0.863291 –0.010309 1.247203 0.237609 0.377436 –1.387024 ... –0.108300 0.005274 –0.190321 –1.175575 0.647376 –0.221929 0.062723 0.061458 123.50 0
4 2.0 –1.158233 0.877737 1.548718 0.403034 –0.407193 0.095921 0.592941 –0.270533 –0.817739 ... –0.009431 0.798278 –0.137458 0.141267 –0.206010 0.502292 0.219422 0.215153 69.99 0
5 rows × 31 columns
To implement forward selection, we can use the mlxtend library, which contains most of the feature selection techniques
based on wrapper methods. Of note, the stopping criteria in mlxtend implementation are an arbitrarily set number of
features.
Input:
forward=True,
floating=False,
scoring = 'r2',
cv = 0)
# Print selected features (we could replace the two following lines with print(sfs.
k_feature_names_)
selected_feat= X.columns[list(sfs.k_feature_idx_)]
print(selected_feat)
Output:
Index(['V10', 'V12', 'V14', 'V16', 'V17'], dtype='object')
In the sequential forward selection (SFS) section of the code above, we have set some parameters. First, we have chosen
the LinearRegression() estimator for the process. We could choose any scikit-learn classifier or regressor. We have also cho-
sen to select the “best” five features from the dataset. This number can be any that we select, but we can also assess the
optimal value by analyzing the scores for different numbers. Here, we set forward as true and floating as false for the forward
selection technique. The evaluation criterion is provided by the parameter scoring. Here, we have chosen R-squared; we
have not chosen k-fold cross-validation. We can read the selected feature indices as follows:
Input:
sfs.subsets_
Output:
{1: {'feature_idx': (17,),
'cv_scores': array([0.10679159]),
'avg_score': 0.10679159377776526,
'feature_names': ('V17',)},
2: {'feature_idx': (14, 17),
'cv_scores': array([0.20079793]),
'avg_score': 0.20079793094854692,
'feature_names': ('V14', 'V17')},
3: {'feature_idx': (12, 14, 17),
'cv_scores': array([0.27027216]),
'avg_score': 0.27027215811858596,
'feature_names': ('V12', 'V14', 'V17')},
4: {'feature_idx': (10, 12, 14, 17),
'cv_scores': array([0.31588019]),
'avg_score': 0.31588019129581957,
'feature_names': ('V10', 'V12', 'V14', 'V17')},
5: {'feature_idx': (10, 12, 14, 16, 17),
'cv_scores': array([0.35472495]),
'avg_score': 0.35472495313955954,
'feature_names': ('V10', 'V12', 'V14', 'V16', 'V17')}}
The prediction score for our five features is computed as follows:
2.5 Feature Extraction and Selection 149
Input:
sfs.k_score_
Output:
0.35472495313955954
Let us now change some parameters. For example, we can use KNN instead of linear regression and use cross-validation.
Input:
# Print selected features (we could replace the two following lines with print(sfs.
k_feature_names_)
selected_feat= X.columns[list(sfs.k_feature_idx_)]
print(selected_feat)
Output:
With this configuration, the process takes more time to compute when run on a personal computer. We have added the
n_jobs = −1 option in SFS to run the cross-validation on all our available CPU cores.
Let us read the selected feature indices:
150 2 Feature Engineering Techniques in Machine Learning
Input:
sfs.subsets_
Output:
Let us also compute the prediction score for our five features:
Input:
sfs.k_score_
Output:
0.9994733261647173
pd.DataFrame.from_dict(sfs.get_metric_dict()).T
Output:
feature_idx cv_scores avg_score feature_names ci_bound std_dev std_err
2 (12, 17) [0.9992758234764862, 0.9993855471921701, 0.999... 0.999333 (V12, V17) 0.000058 0.000045 0.000023
3 (4, 12, 17) [0.9993416577058966, 0.9993636024490333, 0.999... 0.999412 (V4, V12, V17) 0.00007 0.000054 0.000027
4 (3, 4, 12, 17) [0.9994513814215804, 0.9994294366784436, 0.999... 0.999451 (V3, V4, V12, V17) 0.000069 0.000054 0.000027
5 (3, 4, 12, 14, 17) [0.9994294366784436, 0.9994294366784436, 0.999... 0.999473 (V3, V4, V12, V14, V17) 0.000059 0.000046 0.000023
2.5 Feature Extraction and Selection 151
Backward Elimination In backward elimination, contrary to forward stepwise, we start with the full model (all features
including the independent ones). The objective is to eliminate the least significant feature (the worst feature with the highest
p-value > significant level) at each iteration until no improvement of the performance model is observed.
To implement backward stepwise selection, we follow the same procedure as forward stepwise selection with the use of
the mlxtend library.
Input:
import pandas as pd
import numpy as np
Output:
Here, we have chosen the ExtraTreeClassifier() estimator for the process. We have also set forward as false and floating as
false for the backward elimination technique.
We can also plot the performance versus the number of features:
152 2 Feature Engineering Techniques in Machine Learning
Input:
Output:
Backward elimination
0.9995
0.9994
Performance
0.9993
0.9992
0.9991
0.9990
1 2 3 4 5
Number of features
Let us now use RandomForestClassifier as another example. The script below can take quite a bit of time to execute.
Input:
import pandas as pd
import numpy as np
Output:
Index(['V4', 'V7', 'V14', 'V15', 'V17'], dtype='object')
Exhaustive Feature Selection Exhaustive feature selection is a brute-force evaluation of feature subsets that evaluates model
performance (such as classification accuracy) with all feature combinations. For instance, if there are three features, the
model will be tested with feature 0 only, then feature 1 only, feature 2 only, features 0 and 1, features 0 and 2, features
1 and 2, and features 0, 1, and 2. Like the other wrapper methods, the method is computationally expensive (a greedy algo-
rithm) due to its search for all combinations. We can use different approaches, such as reducing the search space, to reduce
this time.
To implement this method, we can also use the ExhaustiveFeatureSelector function from the mlxtend.feature_selection
library. As we can see in the script below, the class has min_features and max_features attributes to specify the minimum
and maximum number of features desired in the combination.
Input:
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from mlxtend.feature_selection import ExhaustiveFeatureSelector as EFS
knn = KNeighborsClassifier(n_neighbors=3)
# Call the fit method on our feature selector and pass it the training set
efs1 = efs1.fit(X_train, y_train)
Output:
Least Absolute Shrinkage and Selection Operator (Lasso) Lasso is a shrinkage method. It performs L1 regularization as follows:
N M 2 M
L1 Regularization = yi − x ij wj +α wj
i=0 j=0 j=0
As we can see, L1 regularization adds a penalty to the cost based on the complexity of the model. In the equation above,
instead of calculating the cost with a loss function (the first term of the equation), there is an additional element (the second
term of the equation) called the regularization term that is used to penalize the model. L1 regularization adds the absolute
value of the magnitude of coefficient (the weights w). The hyperparameter α is a complexity parameter that is non-negative
2.5 Feature Extraction and Selection 155
and controls the amount of shrinkage. This is a hyperparameter that we should tune. A larger value produces a greater
amount of shrinkage, resulting in a more simplified model. If α is 0, there is no elimination of the parameters; increasing
α leads to increased bias, while decreasing α will increase the variance.
Let us take the dataset recorded from European cardholders that we have previously used and select features using lasso.
Input:
import pandas as pd
import numpy as np
# Print total features, selected features and features with coefficients shrank to
zero
print('total features: {}'.format((X_train.shape[1])))
print('selected features: {}'.format(len(selected_feat)))
print('features with coefficients shrank to zero: {}'.format(
np.sum(sel_.estimator_.coef_ == 0)))
Output:
Index(['V3', 'V12', 'V14', 'V17'], dtype='object')
total features: 30
selected features: 4
features with coefficients shrank to zero: 24
As can be seen in the script, we have set α to 0.01. It is important to be careful with the α hyperparameter because the
penalty can highly impact the performance of the model. If it is set too high, it can encourage the removal of important
features.
It could also be important to print the weight values. We can do this with eli5.
156 2 Feature Engineering Techniques in Machine Learning
Input:
import pandas as pd
import numpy as np
# Print the results: the greater the weight the more important the feature
import eli5
eli5.show_weights(regressor, top=-1, feature_names = X_train.columns.tolist())
Output:
y top features
Weight? Feature
+0.006 V11
+0.004 V4
+0.003 V2
+0.002 V21
+0.002 V19
+0.001 V8
+0.001 V27
+0.000 V25
+0.000 Amount
+0.000 Time
L2 Regularization (Ridge Regression) L2 regularization (ridge regression) adds a penalty that is equal to the square of the
magnitude of coefficients:
N M 2 M
L2 Regularization = yi − x ij wj +α w2j
i=0 j=0 j=0
2.5 Feature Extraction and Selection 157
Ridge regression and lasso regression were built to make use of regularization for prediction by penalizing the magnitude
of coefficients and minimizing the errors between actual values and predictions. In contrast to lasso regression, ridge regres-
sion cannot nullify the impact of an irrelevant feature, which is an effective way to reduce the variance when we have many
insignificant features. In other words, use of ridge regression cannot reduce the coefficients to absolute zero (it does not
eliminate features but only minimizes them), which means that if we have data with a very large number of features
out of which only few are significant for our model, the model risks having poor accuracy.
Elastic Net Combination of L1 and L2 regularization produces the elastic net method of adding a hyperparameter:
N M 2 M M
Elastic Net = yi − x ij wj +λ α wj + 1 − α w2j
i=0 j=0 j=0 j=0
where the second component is the penalty function of the elastic net regression. If α = 0, we can recognize ridge regression;
if α = 1, we can recognize lasso regression. Similar to L1 and L2 regularization, cross-validation can be used to tune the
hyperparameter α. The elastic net method balances between lasso, which eliminates features and reduces overfitting,
and ridge, which reduces the impact of features that are not significant in predicting target values.
Selecting Features with Regularization Embedded into Machine Learning AlgorithmsIn embedded methods, we can select an algo-
rithm for classification or regression and choose the penalty we wish to apply. Let us say we want to build a model using, for
example, a linear support vector classification algorithm (LinearSVC in scikit-learn) using an L1 penalty (lasso for regres-
sion tasks and LinearSVC for classification).
Input:
import pandas as pd
import numpy as np
Output:
Selected Features:
Index(['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11',
'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21',
'V22', 'V23', 'V24', 'V25', 'V27', 'V28', 'Amount'],
dtype='object')
Removed Features:
Index(['V26'], dtype='object')
import pandas as pd
import numpy as np
from matplotlib import pyplot
import pandas as pd
from sklearn.preprocessing import LabelEncoder
breastcancer = '../data/breastcancer.csv'
df = pd.read_csv(breastcancer, delimiter=';')
# Encode the data and drop original column from df + remove id variable
enc = LabelEncoder()
df_encoded = df[['diagnosis']].apply(enc.fit_transform)
df = df.drop(['diagnosis',"id"], axis = 1)
# Concatenate dataframes
df = pd.concat([df, df_encoded], axis=1)
Output:
Features Importances
0 radius_mean –1.064278
1 texture_mean –0.230833
2 perimeter_mean 0.391031
3 area_mean –0.026048
4 smoothness_mean 0.136829
5 compactness_mean 0.239398
6 concavity_mean 0.515873
7 concave points_mean 0.274763
8 symmetry_mean 0.220043
9 fractal_dimension_mean 0.038494
10 radius_se 0.118969
11 texture_se –1.348197
12 perimeter_se –0.457078
13 area_se 0.145482
14 smoothness_se 0.018263
15 campactness_se –0.005684
16 concavity_se 0.067884
17 concave points_se 0.035053
18 symmetry_se 0.045293
19 fractal_dimension_se –0.000522
20 radius_worst –0.042753
21 texture_worst 0.504892
22 perimeter_worst 0.060300
23 area_worst 0.011025
24 smoothness_worst 0.274080
25 compactness_worst 0.718200
26 concavity_worst 1.339619
27 concave points_worst 0.495012
28 symmetry_worst 0.716176
29 fractal_dimension_worst 0.099935
160 2 Feature Engineering Techniques in Machine Learning
1.0
0.5
0.0
–0.5
–1.0
0 5 10 15 20 25 30
For regression problems, we could also use, for example, ordinary least squares linear regression to fit a linear model with
coefficients (w1, w2, …, wn) to minimize the residual sum of squares between the predicted and observed targets. Let us
analyze in the following example the medical cost personal dataset (https://www.kaggle.com/mirichoi0218/insurance),
which is used for insurance forecasts, by using linear regression. In the dataset, we will find costs billed by health insurance
companies (insurance charges) and features (age, gender, BMI, children, smoking status).
Input:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
csv_data = '../data/insurance.csv'
df = pd.read_csv(csv_data, delimiter=',')
# Concatenate dataframes
df = pd.concat([df, df_encoded], axis=1)
Output:
Features Importances
0 age 261.625690
1 bmi 344.544831
2 children 424.370166
3 sex 109.647196
4 smoker 23620.802521
5 region –326.462625
20000
15000
10000
500
0 1 2 3 4 5
Tree-Based Feature Importance Tree-based algorithms such as random forest, XGBoost, decision tree, or extra tree are also
commonly used for prediction. They can also be an alternative method to select features by indicating which of them are
more important or the most used in making predictions for our target variable (classification). For the example of random
forest, a machine learning technique used to solve regression and classification consisting of many decision trees, each tree
of the random forest can calculate the importance of a feature. The random forest algorithm can calculate the importance of
a feature because of its ability to increase the “pureness” of the leaves. In other words, when we train a tree, feature impor-
tance is determined as a decrease in node impurity weighted in a tree (the higher the increment in leaf purity, the more
important the feature). We call a situation “pure” when the elements belong to a single class. After a normalization, the sum
of the calculated importance scores is 1. The mean decrease impurity that we call the Gini index (between 0 and 1), used by
random forest to estimate a feature’s importance, measures the degree or probability that a variable has been wrongly clas-
sified when randomly chosen. The index is 0 when all elements belong to a certain class, 1 when the elements are randomly
distributed across various classes, and 0.5 when the elements are equally distributed among classes.
The Gini index is calculated as follows:
n
2
Gini = 1 − pi
i=1
where pi is the probability that an element has been classified in a distinct class.
162 2 Feature Engineering Techniques in Machine Learning
Let us code a simple example to calculate feature importance using random forest.
Input:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
Output:
Features Importances
0 Time 0.011951
1 V1 0.015791
2 V2 0.011951
3 V3 0.019299
4 V4 0.026344
5 V5 0.012594
6 V6 0.014220
7 V7 0.026682
8 V8 0.012779
9 V9 0.031650
10 V10 0.083980
11 V11 0.065454
12 V12 0.147052
13 V13 0.009819
14 V14 0.114699
15 V15 0.011963
16 V16 0.052669 Features importance
17 V17 0.171931
18 V18 0.029313
0.3
19 V19 0.012876
20 V20 0.014186
0.2
Scores
21 V21 0.015796
22 V22 0.010009
23 V23 0.006625 0.1
24 V24 0.010271
25 V25 0.008583 0.0
26 V26 0.017892
27 V27 0.011208
Time
V1
V2
V3
V4
V5
V6
V7
V8
V9
V10
V11
V12
V13
V14
V15
V16
V17
V18
V19
V20
V21
V22
V23
V24
V25
V26
V27
V28
Amount
28 V28 0.010213
29 Amount 0.012203
Random forest has some limitations. For example, if two features are correlated, they will be given similar and
lowered importance. In addition, as a set of decision trees, it gives preference to features with high cardinality.
As stated above, we can perform feature selection with tree-based algorithms such as decision tree (nonparametric,
supervised) for both classification and regression. For classification, we can use DecisionTreeClassifier from scikit-learn.
Input:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
Output:
Features Importances
0 Time 0.028703
1 V1 0.015377
2 V2 0.000000
3 V3 0.004238
4 V4 0.011141
5 V5 0.001695
6 V6 0.009737
7 V7 0.022080
8 V8 0.010224
9 V9 0.000000
10 V10 0.040428
11 Decision tree feature importance
V11 0.005098
12 V12 0.036830
13 V13 0.015404 0.5
14 V14 0.112317
15 V15 0.007372
16 V16 0.017439 0.4
17 V17 0.536157
18 V18 0.001907
19 V19 0.020087 0.3
20 V20 0.005494
21 V21 0.009470
22 V22 0.008294 0.2
23 V23 0.002796
24 V24 0.014243
25 V25 0.006329 0.1
26 V26 0.022797
27 V27 0.016909
28 V28 0.006952 0.0
29 Amount 0.010481 0 5 10 15 20 25 30
2.5 Feature Extraction and Selection 165
For regression, we simply need to replace DecisionTreeClassifier() with DecisionTreeRegressor(). We can employ many
tree-based algorithms for feature selection using both regression, for example, RandomForestRegressor(), GradientBoostin-
gRegressor(), or ExtraTreesRegressor(), and classification, for example, RandomForestClassifier(), GradientBoostingClas-
sifier(), or ExtraTreesClassifier(). It is important to take time to explore which technique will give a model the best
performance.
Permutation Feature Importance The idea of permutation feature importance was introduced by Breiman (2001) for random
forests. It measures the importance of a feature by computing the increase in the model’s prediction error after permuting
the values of the feature. If randomly shuffling the values of a feature increases the model error, it means that the feature is
“important.” By contrast, if shuffling the values of a feature leaves the model error unchanged, the feature is not important
because the model has ignored the feature for the prediction. In other words, if we destroy the information contained in a
feature by randomly shuffling the feature values, the accuracy of our models should decrease. In addition, if the decrease is
substantial, it means that the information contained in the feature is important for our predictions.
Let us say we have trained a model and have measured its quality through MSE, log-loss, or another method. For each
feature in the dataset, we randomly shuffle the data in the feature while keeping the values of other features constant. We
then generate a new model based on the shuffled values, re-evaluate the quality of the newly trained model, and calculate
the feature importance based on the change in the quality of the new model relative to the original one. We perform this
process for all features, allowing us to rank all the features in terms of their predictive usefulness.
Let us use permutation feature importance with KNN for classification.
Input:
import pandas as pd
import numpy as np
Output:
0.000175
0.000150
0.000125
0.000100
0.000075
0.000050
0.000025
0.000000
0 5 10 15 20 25 30
We can also use permutation feature importance with KNN for regression using the same lines of code and replacing
KNeighborsClassifier with KNeighborsRegressor.
Permutation feature importance can be computationally expensive due to the necessary iteration through each predictor.
We also need to pay attention to the potential presence of multicollinearity and consider the context of our model, as scores
can be relative.
As we have seen in this entire chapter, there are many ways to select features. Only a few of the large number of methods
in the literature have been described here. In summary, filter methods do not incorporate a specific machine learning algo-
rithm and are much faster compared to wrapper methods and less prone to overfitting. Wrapper methods evaluate based on
a specific machine learning algorithm to determine the most important features; their drawbacks are the computation time
needed and the high chances of overfitting. In embedded methods, feature selection is performed by observing each iteration
of a model’s training phase; they are effective in reducing overfitting through penalization techniques and represent a good
trade-off regarding computation time.
2.5 Feature Extraction and Selection 167
Feature selection can also be performed by managing features as hyperparameters to fine-tune. In this case, we can create
a pipeline object to assemble the data transformation and apply an estimator. As we have seen, selection of the k-best vari-
ables according to a given correlation metric can be performed by the SelectKBest module of scikit-learn. We can combine
this module with a supervised model in a Pipeline object. GridSearchCV can then be used to perform tuning of the hyper-
parameter considering k as a hyperparameter of the pipeline.
X = pd.read_csv(‘data.csv’)
X_cudf = cudf.read_csv(‘data.csv’)
Another way to leverage GPUs is to use Keras and TensorFlow, in which it is very straightforward to incorporate a reg-
ularization such as L2 or L1:
tf.keras.layers.Dense(32, kernel_regularizer='l2')
tf.keras.layers.Dense(32, kernel_regularizer=l2(0.01),
bias_regularizer=l2(0.01))
The kernel_regularizer will apply a penalty on the kernel of the layer, and the bias_regularizer will apply a bias penalty to
the layer. It is also possible to use L1 and L2 at the same time by adding “kernel_regularizer=l1_l2(l1=0.01, l2=0.01).”
168 2 Feature Engineering Techniques in Machine Learning
We can also explore PyTorch in applying regularization. For example, it is possible to choose the α value and sum the
weights squared:
l2_alpha = 0.001
l2_norm = sum(p.pow(2.0).sum() for p in model.parameters())
The bias can also be added after calculating the loss function:
We will explore GPUs further in the description of machine and deep learning algorithms, as they are crucial in several
use cases to accelerate the computing time.
• feature_selection: Here, we can select a feature selection method (filter, wrapper, or embedded):
– Filter options:
◦ variance_threshold: Apply a variance threshold. If we choose this option, we also need to indicate the features we
want to process (features_to_process= [’feature_1’, ’feature_2’, …]) and the threshold (var_threshold=0 or any
number).
◦ chi2: Perform a chi-squared test on the samples and retrieve only the k-best features. We can define k with the
k_features parameter.
◦ anova_f_c: Create a SelectKBest object to select features with the k-best ANOVA F-values for classification. We can
define k with the k_features parameter.
◦ anova_f_r: Create a SelectKBest object to select features with the k-best ANOVA F-values for regression. We can
define k with the k_features parameter.
◦ pearson: The main idea for feature selection is to retain the variables that are highly correlated with the target and
keep features that are uncorrelated among themselves. The Pearson correlation coefficient between features is
defined by cc_features, and that between features and the target is defined by cc_target.
Here are short examples:
or
– Wrapper methods: The following options are available for feature_selection: “forward_stepwise,”
“backward_elimination,” and “exhaustive.”
2.5 Feature Extraction and Selection 169
◦ wrapper_classifier: In wrapper methods, we need to select a classifier or regressor. We can choose one from scikit-
learn, such as KneighborsClassifier(), RandomForestClassifier, LinearRegression, or others, and apply it to forward
stepwise (forward_stepwise), backward elimination (backward_elimination), or exhaustive (exhaustive) methods.
◦ min_features and max_features: These are attributes for exhaustive to specify the minimum and maximum num-
ber of features desired in the combination.
Here is a short example:
# Run ML Pipeline
ml_pipeline_function(df, output_folder = './Outputs/', missing_method =
'row_removal', test_size = 0.2, categorical = ['label_encoding'],features_label =
['Target'], rescaling = 'standard_scaler', feature_selection =
'backward_elimination', wrapper_classifier = KNeighborsClassifier())
– Embedded methods:
◦ feature_selection: We can select from several methods.
# Run ML Pipeline
ml_pipeline_function(df, output_folder = './Outputs/', missing_method =
'row_removal', test_size = 0.2, categorical = ['label_encoding'],features_label =
['Target'], rescaling = 'standard_scaler', feature_selection = 'feat_reg_ml',
ml_penalty = LinearSVC(C=0.05, penalty='l1', dual=False, max_iter = 5000))
170 2 Feature Engineering Techniques in Machine Learning
Further Reading
Aeberhard, S., Coomans, D., and de Vel, O. (1992). Comparison of classifiers in high dimensional settings. Technical Report no.
92-01. Dept. of Computer Science and Dept. of Mathematics and Statistics, James Cook University of North Queensland. (Also
submitted to Technometrics).
Aeberhard, S., Coomans, D., and de Vel, O. (1992). The classification performance of RDA. Technical Report no. 92-01. Dept. of
Computer Science and Dept. of Mathematics and Statistics, James Cook University of North Queensland. (Also submitted to
Journal of Chemometrics).
Alkharusi, H. (2012). Categorical variables in regression analysis: a comparison of dummy and effect coding. International Journal
of Education 4: 202–210. https://doi.org/10.5296/ije.v4i2.1962.
Azur, M.J., Stuart, E.A., Frangakis, C., and Leaf, P.J. (2011). Multiple imputation by chained equations: what is it and how does it
work? International Journal of Methods in Psychiatric Research 20 (1): 40–49. https://doi.org/10.1002/mpr.329.
Belkin, M. and Niyogi, P. (2001). Laplacian eigenmaps and spectral techniques for embedding and clustering. Advances in Neural
Information Processing Systems 14: 585–591.
Bengio, Y. and Monperrus, M. (2004). Non-local manifold tangent learning. In: Proceedings of the 17th International Conference on
Neural Information Processing Systems (NIPS’04), vol. 17, pp. 129–136. MIT Press.
Bingham, E. and Mannila, H. (2001). Random projection in dimensionality reduction: applications to image and text data. In:
Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ‘01), New
York, NY, USA, pp. 245–250. ACM.
Birjandtalab, J., Pouyan, M.B., and Nourani, M. (2016). Nonlinear dimension reduction for EEG-based epileptic seizure detection.
In: 2016 IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI), Las Vegas, NV, USA, pp. 595–598.
https://doi.org/10.1109/BHI.2016.7455968.
Box, G.E.P. and Cox, D.R. (1964). An analysis of transformations. Journal of the Royal Statistical Society B 26: 211–252.
Brand, M. (2002). Charting a manifold. In: Proceedings of the 15th International Conference on Neural Information Processing
Systems (NIPS’02), vol. 15, pp. 961–968. MIT Press.
Breiman, L. (2001). Random forests. Machine Learning 45 (1): 5–32. https://doi.org/10.1023/A:1010933404324.
Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984). Classification and Regression Trees. Wadsworth.
van Buuren, S. (2007). Multiple imputation of discrete and continuous data by fully conditional specification. Statistical Methods in
Medical Research 16: 219–242. https://doi.org/10.1177/0962280206074463.
van Buuren, S. and Groothuis-Oudshoorn, K. (2011). mice: Multivariate imputation by chained equations in R. Journal of Statistical
Software 45: 1–67.
Carey, G. (2003). Coding categorical variables. http://ibgwww.colorado.edu/ carey/p5741ndir/Coding_Categorical_Variables.pdf.
Cestnik, B. and Bratko, I. (1991). On estimating probabilities in tree pruning. In: Machine Learning — EWSL-91. EWSL 1991,
Lecture Notes in Computer Science (Lecture Notes in Artificial Intelligence), vol. 482 (ed. Y. Kodratoff), pp. 138–150. Springer.
https://doi.org/10.1007/BFb0017010.
Cohen, D. (1972). Magnetoencephalography: detection of the brain’s electrical activity with a superconducting magnetometer.
Science 175: 664–666.
Cowell, R.G., Dawid, A.P., Lauritzen, S.L., and Spiegelhalter, D.J. (1999). Probabilistic Networks and Expert Systems. Springer.
Daniel, W.W. and Cross, C.L. (2018). Biostatistics: A Foundation for Analysis in the Health Sciences. Wiley.
Dasarathy, B.V. (1980). Nosing around the neighborhood: a new system structure and classification rule for recognition in partially
exposed environments. IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-2, 1: 67–71.
Dasgupta, S. (2000). Experiments with random projection. In: Proceedings of the Sixteenth conference on Uncertainty in Artificial
Intelligence (UAI’00) (ed. C. Boutilier and M. Goldszmidt), 143–151. Morgan Kaufmann Publishers Inc.
Deng, X., Li, Y., Weng, J., and Zhang, J. (2019). Feature selection for text classification: a review. Multimedia Tools and Applications
78, 3 (Feb 2019): 3797–3816. https://doi.org/10.1007/s11042-018-6083-5.
Donoho, D. and Grimes, C. (2003). Hessian eigenmaps: locally linear embedding techniques for high dimensional data. Proceedings
of National Academy of Sciences of the United Stated of America 100 (10): 5591–5596.
Duda, R.O. and Hart, P.E. (1973). Pattern Classification and Scene Analysis (Q327.D83). Wiley.
Durrett, R. (1996). Probability: Theory and Examples, 2e, 62. Duxbury Press.
Dy, J.G. and Brodley, C.E. (2004). Feature selection for unsupervised learning. Journal of Machine Learning Research 5: 845–889.
Feller, W. (1968). An Introduction to Probability Theory and Its Applications, 3e, vol. 1.
Ferri, F.J., Pudil, P., Hatef, M., and Kittler, J. (1994). Comparative study of techniques for large-scale feature selection. In: Machine
Intelligence and Pattern Recognition, vol. 16 (ed. E.S. Gelsema and L.S. Kanal), pp. 403–413. North-Holland.
Further Reading 171
Fisher, R.A. (1936). The use of multiple measurements in taxonomic problems. Annual Eugenics 7 (Part II): 179–188; also in
Contributions to Mathematical Statistics (Wiley, NY, 1950).
Florescu, I. (2014). Probability and Stochastic Processes. Wiley.
Friedman, J. (2001). Greedy function approximation: a gradient boosting machine. The Annals of Statistics 29 (5): 1189–1232.
https://doi.org/10.1214/aos/1013203451.
Friedman, J. (2002). Stochastic gradient boosting. Computational Statistics and Data Analysis 38 (4): 367–378. https://doi.org/
10.1016/S0167-9473(01)00065-2.
Gallager, R.G. (2013). Stochastic Processes Theory for Applications. Cambridge University Press.
Gashler, M., Ventura, D., and Martinez, T. (2008). Iterative non-linear dimensionality reduction with manifold sculpting. In:
Advances in Neural Information Processing Systems, vol. 20 (ed. J.C. Platt, D. Koller, Y. Singer, and S. Roweis), pp. 513–520.
MIT Press.
Gates, G.W. (1972). The reduced nearest neighbor rule. IEEE Transactions on Information Theory 1972: 431–433.
Gelman, A. and Hill, J. (2006). Data Analysis Using Regression and Multilevel/Hierarchical Models (Analytical Methods for Social
Research). Cambridge University Press. https://doi.org/10.1017/CBO9780511790942.
George, G. (2004). Testing for the independence of three events. Mathematical Gazette 88: 568.
Grus, J. (2015). Data Science from Scratch. O’Reilly.
Guyon, I. and Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research 3:
1157–1182.
Hamel, P. and Eck, D. (2010). Learning features from music audio with deep belief networks. In: Proceedings of the 11th
International Society for Music Information Retrieval Conference, ISMIR, Utrecht, Netherlands, pp. 339–344.
Hastie, T., Tibshirani, R., and Friedman, J. (2009). Elements of Statistical Learning, 2e. Springer.
Hong, S. and Lynn, H.S. (2020). Accuracy of random-forest-based imputation of missing data in the presence of non-normality,
non-linearity, and interaction. BMC Medical Research Methodology 20: 199. https://doi.org/10.1186/s12874-020-01080-1.
Hwei, P. (1997). Theory and Problems of Probability, Random Variables, and Random Processes. McGraw-Hill. ISBN: 0-07-030644-3.
Hyvärinen, A. and Oja, E. (2000). Independent component analysis: algorithms and applications. Neural Network 13 (4–5): 411–
430. https://doi.org/10.1016/s0893–6080(00)00026–5.
Hyvärinen, A., Karhunen, J., and Oja, E. (2001). Independent Component Analysis. Wiley.
Ioffe, S.; Szegedy, C. (2015). Batch normalization: accelerating deep network training by reducing internal covariate shift. In
Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37 (ICML’15).
JMLR.org, 448–456.
Ipsen, N., Mattei, P., and Frellsen, J. (2022). How to deal with missing data in supervised deep learning? In: Artemiss - ICML
Workshop on the Art of Learning with Missing Values, 1–30. Vienne, Austria. hal-03044144. https://openreview.net/pdf?
id=J7b4BCtDm4.
Jamieson, A.R., Giger, M.L., Drukker, K. et al. (2010). Exploring nonlinear feature space dimension reduction and data
representation in breast CADx with Laplacian Eigenmaps and t-SNE. Medical Physics 37 (1): 339–351. https://doi.org/10.1118/
1.3267037.
Junn, J. and Masuoka, N. (2020) Replication data for: the gender gap is a race gap: women voters in U.S. Presidential Elections.
Harvard Dataverse, V1. https://doi.org/10.7910/DVN/XQYJKN.
Juszczak, P., Tax, D.M.J., and Dui, R.P.W. (2002). Feature scaling in support vector data description. In: Proceedings of the ASCI
2002 8th Annual Conference of the Advanced School for Computing and Imaging. Citeseer, pp. 95–102. https://www.researchgate.
net/publication/2535451_Feature_Scaling_in_Support_Vector_Data_Description.
Kohavi, R. and John, G.H. (1997). Wrappers for feature subset selection. Artificial Intelligence 97 (1–2): 273–324.
Lapidoth, A. (2017). A Foundation in Digital Communication. Cambridge University Press.
Levina, E. and Bickel, P.J. (2004). Maximum likelihood estimation of intrinsic dimension. In: Proceedings of the 17th International
Conference on Neural Information Processing Systems (NIPS’04). pp. 777–784. MIT Press.
Li, T., Zhu, S., and Ogihara, M. (2006). Using discriminant analysis for multi-class classification: an experimental investigation.
Knowledge and Information Systems 10 (4): 453–472.
Little, R.J.A. and Rubin, D.B. (1986). Statistical Analysis with Missing Data. Wiley.
Liu, H. and Motoda, H. (1998). Feature Selection for Knowledge Discovery & Data Mining. Kluwer Academic Publishers.
Liu, H. and Motoda, H. (ed.) (2007). Computational Methods of Feature Selection. Chapman and Hall/CRC Press.
Liu, H. and Yu, L. (2005). Toward integrating feature selection algorithms for classification and clustering. IEEE Transactions on
Knowledge and Data Engineering 17 (3): 1–12.
172 2 Feature Engineering Techniques in Machine Learning
Ma, S. and Huang, J. (2008). Penalized feature selection and classification in bioinformatics. Briefings in Bioinformatics 9 (5):
392–403.
van der Maaten, L.J.P. (2009). Learning a parametric embedding by preserving local structure. In: Proceedings of the Twelth
International Conference on Artificial Intelligence and Statistics, Clearwater Beach, Florida, USA, PMLR 5: 384–391.
van der Maaten, L.J.P. (2014). Accelerating t-SNE using tree-based algorithms. Journal of Machine Learning Research 15 (October):
3221–3245.
van der Maaten, L.J.P. and Hinton, G.E. (2008). Visualizing high-dimensional data using t-SNE. Journal of Machine Learning
Research 9 (November): 2579–2605.
van der Maaten, L.J.P. and Hinton, G.E. (2012). Visualizing non-metric similarities in multiple maps. Machine Learning 87 (1):
33–55.
Martinez, M. and Kak, A.C. (2001). PCA versus LDA. IEEE Transactions on Pattern Analysis and Machine Intelligence 23 (2): 228–
233. https://doi.org/10.1109/34.908974.
Miccii-Barreca, D. (2001). A preprocessing scheme for high-cardinality categorical attributes in classification and prediction
problems. SIGKDD Explorations Newsletter 3: 1. http://dx.doi.org/10.1145/507533.507538.
Papoulis, A. (1991). Probability, Random Variables and Stochastic Processes. McGraw Hill.
Park, K. (2018). Fundamentals of Probability and Stochastic Processes with Applications to Communications. Springer.
Parson, L., Haque, E., and Liu, H. (2004). Subspace clustering for high dimensional data – a review. ACM SIGKDD Explorations
Newsletter Archive special issue on learning from imbalanced datasets 6 (1): 90–105. 1931–0145.
Pedregosa, F., Grisel, O., Blondel, M., et al. (2011). Manifold learning on handwritten digits: locally linear embedding, Isomap.
License: BSD 3 clause (C) INRIA 2011, Online scikit-learn documentation, Scikit-learn: Machine Learning in Python, Pedregosa
et al., JMLR 12, pp. 2825–2830. https://scikit-learn.org/stable/auto_examples/manifold/plot_lle_digits.html.
van der Plas. (2016). Python Data Science Handbook. O’Reilly Media, Inc. 9781491912058
Pudil, P., Novovičová, J., and Kittler, J. (1994). Floating search methods in feature selection. Pattern Recognition Letters 15 (11):
1119–1125.
Radhakrishna Rao, C. (1948). The utilization of multiple measurements in problems of biological classification. Journal of the Royal
Statistical Society, Series B (Methodological) 10 (2): 159–203.
Raghunathan, T.W., Lepkowksi, J.M., Van Hoewyk, J., and Solenbeger, P. (2001). A multivariate technique for multiply imputing
missing values using a sequence of regression models. Survey Methodology 27: 85–95.
ResearchGate. Iterative non-linear dimensionality reduction with manifold sculpting. https://www.researchgate.net/publication/
220270207_Iterative_Non-linear_Dimensionality_Reduction_with_Manifold_Sculpting.
Robnik-Sikonja, M. and Kononenko, I. (2003). Theoretical and empirical analysis of relief and relief. Machine Learning 53: 23–69.
Roweis, S.T. (1997). Em algorithms for PCA and SPCA. In: Advances in Neural Information Processing Systems, vol. 10 (ed. M.I.
Jordan, M.J. Kearns, and S.A. Solla), pp. 626–632. https://api.semanticscholar.org/CorpusID:1939401.
Roweis, S.T. and Saul, L.K. (2000). Nonlinear dimensionality reduction by locally linear embedding. Science 290.
Russell, S. and Norvig, P. (2002). Artificial Intelligence: A Modern Approach. Prentice Hall.
Saporta, G. (2006). Probabilités, analyse des données et statistique, Technip Éditions, p. 622 (ISBN 2-7108-0565-0). https://
towardsdatascience.com/feature-extraction-techniques-d619b56e31be.
Saul, L.K. and Roweis, S.T. (2003). Think globally, fit locally: unsupervised learning of low dimensional manifolds. Journal of
Machine Learning Research 4: 119–155.
Schafer, J.L. (1999). Multiple imputation: a primer. Statistical Methods in Medical Research 8 (1): 3–15.
Schölkopf, B., Smola, A.J., and Müller, K.-R. (1999). Kernel principal component analysis. In: Advances in Kernel Methods: Support
Vector Learning, 327–352. MIT Press.
Shah, A.D., Bartlett, J.W., Carpenter, J. et al. (2014). Comparison of random forest and parametric imputation models for imputing
missing data using MICE: a CALIBER study. American Journal of Epidemiology 179 (6): 764–774. https://doi.org/10.1093/aje/
kwt312.
Shanker, M., Hu, M.Y., and Hung, M.S. (1996). Effect of data standardization on neural network, training. Omega 24: 385–397.
https://www.sciencedirect.com/science/article/pii/0305048396000102.
de Silva, V. and Tenenbaum, J. B. (2002). Global versus local methods in nonlinear dimensionality reduction. In: Proceedings of the
15th International Conference on Neural Information Processing Systems (NIPS’02). pp. 721–728. MIT Press.
Singhi, S., and Liu, H. (2006). Feature subset selection bias for classification learning. In: Proceedings of the 23rd international
conference on Machine learning (ICML ’06). Association for Computing Machinery, New York, NY, USA, pp. 849–856. https://
doi.org/10.1145/1143844.1143951.
Further Reading 173
Su, Y.S., Gelman, A., Hill, J., and Yajima, M. (2009). Multiple imputation with diagnostics (mi) in R: opening windows into the
black box. Journal of Statistical Software 45: 1–31.
Sumithra, V. and Surendran, S. (2015). A review of various linear and non linear dimensionality reduction techniques.
International Journal of Computer Science and Information Technologies 6 (3): 2354–2360.
Tenenbaum, J.B., de Silva, V., and Langford, J.C. (2000). A global geometric framework for nonlinear dimensionality reduction.
Science 290: 2319–2323.
Vincent, P. and Bengio, Y. (2002). Manifold parzen windows. In: Proceedings of the 15th International Conference on Neural
Information Processing Systems (NIPS’02), vol. 15, pp. 825–832. MIT Press.
Wallach, I. and Liliean, R. (2009). The protein-small-molecule database, a non-redundant structural resource for the analysis of
protein-ligand binding. Bioinformatics 25 (5): 615620. PMID 19153135. https://doi.org/10.1093/bioinformatics/btp035.
Wang, J. (2012). Geometric Structure of High-Dimensional Data and Dimensionality Reduction. Springer. https://doi.org/10.1007/
978-3-642-27497-8.
Weinberger, K., Dasgupta, A., Langford, J. et al. (2009). Feature hashing for large scale multitask learning. In: Proceedings of the
26th Annual International Conference on Machine Learning (ICML June, 14th, 2009). Association for Computing Machinery,
New York, NY, USA, 1113–1120. https://doi.org/10.1145/1553374.1553516.
Weisberg S. (2001). Yeo-Johnson power transformations. www.stat.umn.edu/arc/ (accessed 26 October 2001).
Wu, Y.N. (2014). Statistical independence. In: Computer Vision (ed. K. Ikeuchi). Springer. https://doi.org/10.1007/978-0-387-
31439-6_744.
Yeo, I.K. and Johnson, R.A. (2000). A new family of power transformations to improve normality or symmetry. Biometrika 87 (4):
954–959.
Yu, L. and Liu, H. (2004). Efficient feature selection via analysis of relevance and redundancy. Journal of Machine Learning
Research 5 (October): 1205–1224.
Yu, K., Guo, X., and Liu, L., et al. (2020). Causality-based feature selection: methods and evaluations. ACM Computing Surveys 53, 5,
Article 111 (September 2021). https://doi.org/10.1145/3409382.
Zhang, Z. and Zha, H. (2006). A domain decomposition method for fast manifold learning. In: Advances in Neural Information
Processing Systems, vol. 18 (ed. Y. Weiss, B. Schölkopf, and J. Platt). MIT Press.
Zhao, Z. and Liu, H. (2007a). Searching for interacting features. Conference: IJCAI 2007, Proceedings of the 20th International Joint
Conference on Artificial Intelligence, Hyderabad, India (6–12 January 2007).
Zhao, Z. and Liu, H. (2007b). Semi-supervised feature selection via spectral analysis. SDM.
Feature extraction (audio, video, text) https://www.mathworks.com/discovery/feature-extraction.html
https://www.kaggle.com/c/caterpillar-tube-pricing/discussion/15748#143154
http://contrib.scikit-learn.org/category_encoders/jamesstein.html
http://genet.univ-tours.fr/gen002200/bibliographie/Bouquins%20INRA/Biblio/Independent%20component%20analysis%20A%
20tutorial.pdf
http://psych.colorado.edu/ carey/Courses/PSYC5741/handouts/Coding%20Categorical%20Variables%202006-03-03.pdf
http://surfer.nmr.mgh.harvard.edu/fswiki
http://usir.salford.ac.uk/id/eprint/52074/1/AI_Com_LDA_Tarek.pdf
https://analyticsindiamag.com/5-ways-handle-missing-values-machine-learning-datasets/
https://bib.irb.hr/datoteka/763354.MIPRO_2015_JovicBrkicBogunovic.pdf
https://contrib.scikit-learn.org/category_encoders/index.html
https://cran.r-project.org/web/packages/miceRanger/vignettes/miceAlgorithm.html
https://cs.nyu.edu/ roweis/lle/papers/lleintro.pdf
https://datascienceplus.com/understanding-the-covariance-matrix/
https://docs.rapids.ai/api
https://en.wikipedia.org/wiki/Decision_tree_learning
https://inside-machinelearning.com/regularization-deep-learning/
https://machinelearningmastery.com/basic-feature-engineering-time-series-data-python/
https://machinelearningmastery.com/power-transforms-with-scikit-learn/
https://medium.com/analytics-vidhya/linear-discriminant-analysis-explained-in-under-4-minutes-e558e962c877
https://medium.com/rapids-ai/accelerating-random-forests-up-to-45x-using-cuml-dfb782a31bea
https://miro.medium.com/max/2100/0∗NBVi7M3sGyiUSyd5.png
https://nycdatascience.com/blog/meetup/featured-talk-1-kaggle-data-scientist-owen-zhang/
174 2 Feature Engineering Techniques in Machine Learning
https://pandas.pydata.org/docs/reference/api/pandas.DatetimeIndex.html
https://scikit-learn.org/dev/modules/lda_qda.html
https://scikit-learn.org/stable/
https://scikit-learn.org/stable/auto_examples/decomposition/plot_pca_vs_lda.html#sphx-glr-auto-examples-decomposition-plot-
pca-vs-lda-py
https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html
https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.GenericUnivariateSelect.html#sklearn.
feature_selection.GenericUnivariateSelect
https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html
https://scikit-learn.org/stable/modules/impute.html#impute
https://sebastianraschka.com/Articles/2014_python_lda.html
https://stats.oarc.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/
https://towardsdatascience.com/7-ways-to-handle-missing-values-in-machine-learning-1a6326adf79e
https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02
https://towardsdatascience.com/all-about-feature-scaling-bcc0ad75cb35
https://towardsdatascience.com/box-cox-transformation-explained-51d745e34203
https://towardsdatascience.com/feature-extraction-techniques-d619b56e31be
https://towardsdatascience.com/preprocessing-with-sklearn-a-complete-and-comprehensive-guide-670cb98fcfb9
https://towardsdatascience.com/top-4-time-series-feature-engineering-lessons-from-kaggle-ca2d4c9cbbe7
https://towardsdatascience.com/types-of-transformations-for-better-normal-distribution-61c22668d3b9
https://towardsdatascience.com/understand-data-normalization-in-machine-learning-8ff3062101f0
https://www.analyticsvidhya.com/blog/2019/12/6-powerful-feature-engineering-techniques-time-series/
https://www.analyticsvidhya.com/blog/2020/10/feature-selection-techniques-in-machine-learning/
https://www.datacamp.com/community/tutorials/categorical-data
https://www.kaggle.com/code/louise2001/rapids-feature-importance-is-all-you-need/notebook
https://www.kaggle.com/davidbnn92/weight-of-evidence-encoding
https://www.kaggle.com/pmarcelino/data-analysis-and-feature-extraction-with-python
https://www.kaggle.com/prashant111/comprehensive-guide-on-feature-selection
https://www.kaggle.com/subinium/11-categorical-encoders-and-benchmark
https://www.kaggle.com/sumanthvrao/daily-climate-time-series-data
https://www.mygreatlearning.com/blog/label-encoding-in-python/#labelencoding
https://www.statsmodels.org/dev/contrasts.html
175
In the literature, we can find many machine learning algorithms that can be used for different tasks, including simple linear
regression for prediction problems, decision trees, naïve Bayes classifiers, random forests, neural networks, and support
vector machines (SVMs). In this chapter, we will study some of the most commonly used machine learning algorithms along
with their fundamental math, use cases, and coding using Python and the various libraries presented in the first chapter of
this book. We have already encountered the concepts of supervised, unsupervised, and reinforcement learning. If we probe a
bit more, we can categorize the algorithms based on their underlying mathematical model: regression, clustering, Bayesian,
neural network, ensemble, regularization, rule system, dimensionality reduction, or decision tree. We have already seen
some algorithms and their respective categories.
As its name indicates, Bayesian machine learning models are based on Bayes’ theorem. It means that machine learning
models are based on the calculation of probability, for instance, the probability that Cristiano Ronaldo will score three goals
knowing that he scored two in his last match. Regression involves finding a relationship between variables in our data. This
is based on geometry as we try to find the line with best slope that can be fit into our data and minimize error. We can then
use this line to output our prediction values. The objective of clustering is to divide our data points into several groups. The
groups bring together the data points that are more like other data points in the same group than those in other groups.
Artificial neural networks are inspired by the human brain and mimic the way biological neurons communicate. Neural
networks are composed of node layers (artificial neurons) with an input layer, hidden layers, and an output layer. Each node
has a weight and a threshold and is connected to another. If the output of an individual node is above the threshold value
(specified by the user), the node is activated and sends data to the next layer of the network. Previously, we have seen reg-
ularization used in conjunction with classification or regression algorithms that penalize features that do not contribute to
Machine Learning Theory and Applications: Hands-on Use Cases with Python on Classical and Quantum Machines, First Edition. Xavier Vasques.
© 2024 John Wiley & Sons, Inc. Published 2024 by John Wiley & Sons, Inc.
176 3 Machine Learning Algorithms
the model given a coefficient threshold. Ensemble methods use multiple learning algorithms to obtain better performance.
Rule-based machine learning algorithms run with predefined sets of rules, created by the user.
Let us explore some popular algorithms.
3
y-intercept = 2
2
The question is how we can find the regression coefficient to draw the best-fitting line for linear regression. We can use batch
gradient descent, stochastic gradient descent, or normal equation.
First, we need to define a cost function. We call the error between predicted values and observed values “residuals.” Our
cost function is the sum of squares of residuals.
3.1 Linear Regression 177
Regression
line
i ”
“y
ed
ict
ed
Pr
Residual
Observed “yi”
We denote the cost function (J), which is the minimum of the root-mean-squared error of the model, obtained by subtracting
the predicted values from actual values. The cost function is the minimum of these error values:
1 m 2
J θ = hθ x i − y i
2m i = 1
hθ x = y = θ 0 + θ 1 x
For each valid data entry in our data, we want to minimize the following function:
2
hθ x i − yi
This function corresponds to the difference between our hypothetical model’s predictions and the real values. For every
value, we can generalize as follows:
m
2
hθ x i − yi
i=1
where m is the number of records in our dataset. The objective of the learning algorithm is to find the ideal parameters θ0
and θ1 so that hθ(x) is close to y for the training examples (x, y). As expressed above, this is represented by the following
mathematical expression that we need to minimize:
1 m 2
hθ x i − yi
2m i = 1
where hθ(xi) = θ0 + θixi, (xi, yi) represent the ith training data, m is the number of training examples, and 12 is a constant
introduced to help when performing calculations for gradient descent. It produces the cost function, defined as the 12 of
the mean squares of hθ(xi) − yi, with the learning objective to minimize it by using, for example, gradient descent:
1 m 2
J θ0 , θ1 = hθ x i − y i
2m i = 1
case, gradient descent will choose random values of θ0 and θ1 and iteratively update values of θ0 and θ1 until a convergence
at which we reach local minima, meaning that the cost function does not decrease further.
Let us review our cost function:
1 m 2
J θ0 , θ1 = hθ x i − yi
2m i = 1
∂
θj = θj − α J θ0 , θ1
∂θj
where α is the learning rate. The algorithm iteratively calculates the next point using a gradient at the current position then
scales it by using the learning rate and subtracts the obtained value from the current position. The smaller the learning rate
(maximum iteration), the longer gradient descent converges (reaching the optimum point). On the other hand, if the learn-
ing rate is too high, the algorithm has a risk of failing to converge to the optimal point.
∂ ∂ 1 m 2
Jθ = hθ x i − yi
∂θ ∂θ 2m i = 1
∂ 1 m ∂
Jθ = hθ x i − yi θx i − yi
∂θ mi=1 ∂θj
∂ 1 m
Jθ = hθ x i − y i x i
∂θ mi=1
α m
θj = θj − hθ x i − yi x i
mi=1
where θj is the weight of the hypothesis, hθ(xi) is the predicted y value for the ith input, j is the feature index number, and α is
the learning rate.
To illustrate the gradient descent process, let us take a simple (univariate) example with a quadratic function:
f x = x 2 + 2x + 1
df x
= 2x + 2
dx
3.1 Linear Regression 179
Let us take a learning rate of 0.1 and a starting point at x = 10. If we calculate the first two steps, we obtain the following:
x 1 = 10 − 0 1 2 10 + 2 = 7 8
f x 1 = 100 + 20 + 1 = 121
x 2 = 7 8 − 0 1 2 7 8 + 2 = 6 04
f x 2 = 7 8 7 8 + 2 7 8 + 1 = 77 44
Let us now implement a gradient descent in Python to find the local minimum of our function.
Input:
plt.plot(x, y,'r-')
x0 = 10 # start point
y0 = fonction(x0)
plt.plot(x0, fonction(x0))
cond = eps + 10.0 # start with cond greater than eps (assumption)
nb_iter = 0
tmp_y = y0
Output:
100
80
60
40
20
To implement gradient descent, we need a starting point, which we have defined manually as 10 but is often a random
initialization in the real world. We also need the gradient function, a learning rate, a maximum number of iterations,
and a tolerance to stop the algorithm conditionally.
Let us take another example with the cost function:
1 m 2
J θ0 , θ1 = hθ x i − yi
2m i = 1
We will consider the case of setting θ0 = 0, meaning that we have the following:
hθ x = θ1 x
Different lines will pass through the origin for any value of θ1 because the y-intercept is equal to zero.
2
θ1 =
=1
θ1
5
4
3
2
1
1 2 3 4 5
We can calculate the cost function manually for different values of θ1. Let us set the number of records in our dataset to
m = 3.
3.1 Linear Regression 181
1 m 2
J θ1 = θ 1 x i − yi
2m i = 1
1
J 2 = 12 + 22 + 32 = 2 33
2×3
1
J 1 = 02 + 02 + 02 = 0
2×3
J 0 5 = 0 58
J(θ1)
–2 –1 0 1 2
Gradient descent does not work for all functions, as they must be differentiable (having derivatives for each point in their
domains) and convex (a line segment connecting two points of the function should lie on or above its curve rather than
cross it).
In the case of multiple linear regression, after performing feature rescaling (using mean normalization, for example), our
cost function will take the following form:
1 m 2
J θ0 , θ1 , …, θn = hθ x i − y i
2m i = 1
∂
θj = θj − α J θ0 , θ1 , …, θn
∂θj
where j = 0, 1, …, n.
We can use gradient descent or normal equation to find the optimal parameters. Batch gradient descent and stochastic
gradient descent are further methodologies we can use; batch gradient descent involves calculations over the full training set
at each step, and stochastic gradient descent takes a random instance of training data at each step and then computes the
gradient. The consequence is that batch gradient descent is slower for large training datasets, efficient for convex or rela-
tively smooth error manifolds, and scalable with the number of features. Stochastic gradient descent is faster but does not
provide an optimal result, as it approaches the minimum but does not settle.
182 3 Machine Learning Algorithms
# Import Libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Scikit-learn implementation
# Model initialization
linear_regression_model = LinearRegression()
# Fit the data (train the model)
linear_regression_model.fit(x, y)
# Model prediction
y_predicted = linear_regression_model.predict(x)
# Regression line
plt.plot(x, y_predicted, color='r')
plt.show()
3.1 Linear Regression 183
Output:
Slope: [[1.94369415]]
Intercept: [4.5193723]
Root mean squared error: 0.08610037959679763
R2 score: 0.7795855368773891
7.0
6.5
6.0
y
5.5
5.0
4.5
4.0
0.0 0.2 0.4 0.6 0.8 1.0
x
Using the code, we have computed the slope and the intercept and have assessed the model by introducing the R-squared
score (coefficient of determination), and the root-mean-squared error (RMSE, the square root of the average of the sum of
the squares of residuals):
1 m 2
RMSE = h x i − yi
mi=1
m
2
yi − y
i=1
R2 = 1 − m
2
h x i − yi
i=1
As we can see in the results, we have reduced the prediction error by 77.95% by using regression.
Now, let us implement a simple linear regression with TensorFlow by generating a random dataset. We will use gradient
descent, a learning rate of 0.01, and 1000 epochs.
Input:
# Import Libraries
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
import numpy
import matplotlib.pyplot as plt
rng = numpy.random
# Parameters
learning_rate = 0.01
training_epochs = 1000
display_step = 50
# Training Data
# Generating random linear data
np.random.seed(0)
# There will be 200 data points ranging from 0 to 200
train_X = np.random.rand(200, 1)
train_Y = 4 + 2 * x + np.random.rand(200, 1)
184 3 Machine Learning Algorithms
n_samples = train_X.shape[0]
# tf Graph Input
X = tf.placeholder("float")
Y = tf.placeholder("float")
# Gradient descent
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
# Start training
with tf.Session() as sess:
print("Optimization Finished!")
training_cost = sess.run(cost, feed_dict={X: train_X, Y: train_Y})
print("Training cost=", training_cost, "W=", sess.run(W), "b=", sess.run(b), '\n')
# Graphic display
plt.plot(train_X, train_Y, 'ro', label='Original data')
plt.plot(train_X, sess.run(W) * train_X + sess.run(b), label='Fitted line')
plt.legend()
plt.show()
3.1 Linear Regression 185
Output:
7.0
Original data
6.5 Fitted line
6.0
5.5
5.0
4.5
4.0
0.0 0.2 0.4 0.6 0.8 1.0
As we can see, the result shows the weight W = 2.4589992 and the bias b = 4.244027.
If we want to print the number of GPUs available in our system and how the operations and tensors are assigned (to GPUs,
CPUs, etc.), we can add lines at the top of the code.
Input:
in the world (https://www.kaggle.com/datasets/sohier/calcofi?select=bottle.csv). This dataset has become valuable for doc-
umenting climatic cycles locally. For this exercise, we will extract the following variables:
With these data, we can already ask some questions such as whether there is a relationship between water salinity and
water temperature, or if we can predict the water temperature based on salinity and depth in meters. Let us use the water
temperature (T_DegC) as our dependent variable (target).
To start, we will capture the dataset using pandas DataFrame, drop rows having at least one missing value, and split the
data into the variable we want to predict (T_degC) and the selected features (Depthm, Salnty, O2ml_L).
Input:
import pandas as pd
# Divide the data, y the variable to predict (T_degC) and X the features (Depthm,
SaInty, O2ml_L)
y = df.loc[:, df.columns == 'T_degC']
X = df.loc[:, df.columns != 'T_degC']
print(y)
print(X)
Output:
T_degC Depthm Salnty O2mI_L
Before creating the linear regression model, we can check visually whether a linear relationship exists between the vari-
ables T_degC versus Depthm, T_degC versus Salnty, and T_degC versus O2ml_L. To perform this check, we can compute
some scatter diagrams with the matplotlib library by adding the lines below.
Input:
plt.show()
Output:
Depthm vs T_degC
5000
4000
Depthm
3000
2000
1000
0
0 5 10 15 20 25 30
T_degC
37
10
36
35 8
34
O2ml_L
Salnty
6
33
32 4
31 2
30
0
0 5 10 15 20 25 30 0 5 10 15 20 25 30
T_degC T_degC
We can see that there is no linearity between T_degC and Depthm. We will drop the Depthm variable from our linear regres-
sion model and start dividing the data between the variables to predict T_DegC and both features Depthm and Salnty.
188 3 Machine Learning Algorithms
Input:
import pandas as pd
# Divide the data, y the variable to predict (T_degC) and X the features (Depthm,
Salnty, O2ml_L)
y = df.loc[:, df.columns == 'T_degC']
X = df.loc[:, df.columns != 'T_degC']
Multiple Linear Regression with scikit-learn To apply linear regression to the data and find the intercept and coefficients, we
can use sklearn.
Input:
# Import libraries
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import preprocessing
# Divide the data, y the variable to predict (T_degC) and X the features (Depthm,
SaInty, O2ml_L)
y = df.loc[:, df.columns == 'T_degC']
X = df.loc[:, df.columns != 'T_degC']
X_train = Normalize.fit_transform(X_train)
X_test = Normalize.fit_transform(X_test)
Output:
Intercept:
[143.98894321]
Coefficients:
[[–138.18191503 41.17953079]]
Predicted T_degC:
[[15.24926345]
[ 8.14078918]
[ 9.9864041 ]
...
[15.81746572]
[ 7.8816554]
[ 7.53595955 ]]
At this stage, we are ready to apply our linear regression model to new data, as the output shows the intercept and
coefficients:
We can use new data (X_test) and apply our model (regr) to predict the water temperature.
Input:
Output:
Predicted T_degC :
[[15.24926345]
[ 8.14078918]
[ 9.9864041 ]
...
[15.81746572]
[ 7.8816554]
[ 7.53595955 ]]
To use the final model in the future, it is important to save the model and load it when needed.
Input:
An important step in multiple linear regression is data scaling. In our code, we will normalize both X_train and X_test after
splitting the data.
Input:
We could normalize the data before splitting it and then create and train our model on the normalized data. We can do it to
understand the mathematical structure of our models. But in real life, we do not have the new data. Therefore, we need to
normalize it as it comes. This is all dependent on the size of the datasets and whether both training and test sets are equally
representative of the domain we are attempting to learn with our model. If we have many data points and the test set is
representative of our training set, then we can normalize the test dataset as shown above or use the normalization para-
meters of the training set (mean, standard deviation). Both methods would be satisfactory. For a small but representative test
dataset, it would be better to use the training parameters only, as sampling errors may negatively bias the predictions. If the
test dataset is not representative of the training set, then we need to reconsider our sampling procedure.
There is no learning rate here, as this is not learned with gradient descent. In addition, the regressors X are normalized
before regression by subtracting the mean and dividing by the l2 norm. We can use a different scaling method; for example,
we can use StandardScaler.
It is possible to fit a linear model fitted by minimizing a regularized empirical loss stochastic gradient descent by using
SGDRegressor in scikit-learn. The default value of the learning rate is 0.01.
Input:
# Importing. libraries
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import SGDRegressor
3.1 Linear Regression 191
# Divide the data, y the variable to predict (T_degC) and X the features (Depthm,
SaInty, O2ml_L)
y = pd.DataFrame(df.loc[:, df.columns == 'T_degC'], columns = ["T_degC"])
X = pd.DataFrame(df.loc[:, df.columns != 'T_degC'])
# with sklearn
from sklearn import linear_model
regr = linear_model.SGDRegressor(learning_rate = 'constant', max_iter=1000, tol=1e-3)
regr.fit(X_train, y_train)
Output:
Intercept :
[10.99528094]
Coefficients :
[1.92547404 4.84608926]
Predicted T_degC :
[16.19572688 8.31397028 10.44245724 ... 16.86515669 7.3999601
7.90504924]
192 3 Machine Learning Algorithms
# Import libraries
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import preprocessing
# Divide the data, y the variable to predict (T_degC) and X the features (Depthm,
SaInty, O2ml_L)
y = df.loc[:, df.columns == 'T_degC']
X = df.loc[:, df.columns != 'T_degC']
# with statsmodels
import statsmodels.api as sm
X_train = sm.add_constant(X_train) # adding a constant
model = sm.OLS(y_train, X_train).fit()
print('Statsmodels parameters:')
print(np.round(reg.params, 3))
print('\n')
3.1 Linear Regression 193
Output:
Statsmodels parameters:
const 143.989
x1 -138.182
x2 41.180
dtype: float64
As we can see in the results, the model does not perform very well, as we have reduced the prediction error by only 62.8% by
using regression.
Multiple Linear Regression with TensorFlow As we will see, computing a linear regression on the same data with TensorFlow
will provide the same results; however, it is also a way to learn some vector and matrix operations (multiply, transpose,
inverse, etc.) in TensorFlow. We will see that it is quite similar to that of np.array with some differences.
Input:
import pandas as pd
import numpy as np
from sklearn import preprocessing
import tensorflow as tf
# Divide the data, y the variable to predict (T_degC) and X the features (Depthm,
SaInty, O2ml_L)
y = df.loc[:, df.columns == 'T_degC'].values.ravel()
X = df.loc[:, df.columns != 'T_degC']
df_results_sk = pd.DataFrame(
np.hstack([reg_mod.intercept_, reg_mod.coef_]))
df_results_sk.columns = ["estimate"]
df_results_sk.index = row_name_results
print("\n############### using sklearn ###############")
print(df_results_sk)
df_results_sm = pd.DataFrame(np.vstack([fit.params,
fit.bse, fit.params/fit.bse]).T)
df_results_sm.columns = ["estimate", "std.err", "t-stats"]
df_results_sm.index = row_name_results
3.1 Linear Regression 195
mX = np.column_stack([np.ones(nrow),X_train])
beta = np.linalg.inv(mX.T.dot(mX)).dot(mX.T).dot(y_train)
err = y_train - mX.dot(beta)
s2 = err.T.dot(err)/(nrow - ncol - 1)
cov_beta = s2*np.linalg.inv(mX.T.dot(mX))
std_err = np.sqrt(np.diag(cov_beta))
df_results_np = pd.DataFrame(
np.row_stack((beta, std_err, beta/std_err)).T)
df_results_np.columns = ["estimate", "std.err", "t-stats"]
df_results_np.index = row_name_results
import tensorflow as tf
# from np.array
y_train = tf.constant(y_train, shape=[nrow, 1])
X_train = tf.constant(X_train, shape=[nrow, ncol])
Output:
Multiple Linear Regression with Keras on TensorFlow Let us now implement a multiple linear regression with Keras on Tensor-
Flow. The approach is very similar to scikit-learn. We capture the dataset in Python using pandas DataFrame, we select the
variables to use, we drop rows having at least one missing value, we divide the data with a target variable and the features,
we split the data to produce training and test datasets, we scale the data, and we create a sequential model. The sequential
layer allows stacking of one layer on top of the other, enabling the data to flow through them. We will use a mini-batch
gradient descent optimizer and mean square loss. Finally, we will check the performance by examining the loss over time;
over time, the loss should decrease.
Let us first code a univariate linear regression with Keras on TensorFlow.
Input:
# Divide the data, y the variable to predict (T_degC) and X the features (Depthm,
SaInty, O2ml_L)
y = df.loc[:, df.columns == 'T_degC'].values.ravel()
X = df.loc[:, df.columns == 'Salnty']
# Testing the model and print the predictions of the inputs X_test
y_pred = model.predict(X_train)
# If we plot the linear equation in red, we will have the same line as the previous one
(green)
#The red line will cover the green one.
plt.plot(X_train, w0*X_train + b, color='r')
plt.xlabel('X_train')
198 3 Machine Learning Algorithms
plt.ylabel('y_train')
plt.show()
Output:
30
25
20
y_train
15
10
0
30 31 32 33 34 35 36 37
x_train
[[10.906912]
[10.992794]
[10.961627]
...
[10.905527]
[10.973402]
[11 .017553]]
We have chosen a learning rate of 0.0001 and 50 epochs. To plot the regression line, we can use two different methods that
produce the same result. The first method is to compute the weights and bias of our model and plot the equation:
We can also print the true values (y_test) versus predicted values (new_predictions).
3.1 Linear Regression 199
Input:
# We can also print the true values (y_test) versus predicted values (new_predictions)
plt.scatter(y_test,new_predictions,color='b')
Output:
11.4
11.2
11.0
10.8
10.6
10.4
0 5 10 15 20 25 30
For future use, we also need to save the model using the following line:
Let us now code a multiple (two-feature) linear regression with Keras on TensorFlow.
Input:
# Divide the data, y the variable to predict (T_degC) and X the features (Depthm,
SaInty, O2ml_L)
y = df.loc[:, df.columns == 'T_degC']
X = df.loc[:, df.columns != 'T_degC']
# We have two inputs for our model: 'Salnty' and 'O2ml_L' features
model = Sequential()
model.add(Dense(1, input_dim = 2, activation = 'linear'))
model.summary()
# Model prediction
y_pred = model.predict(X_train)
# Printing values
print('Root mean squared error: ', rmse)
print('R2 score: ', r2)
print('\n')
print('\n')
Output:
11
10
Loss
0 2 4 6 8
Epochs
Weights:
[[4.5139156e-02]
[5.4653141e+01]]
If we rerun the same code, the output will be different because Keras will retrain the current model. In our example, we can
see that we have reduced the prediction error by only 62.8% by using regression. It is not easy to provide rules, as they would
depend on the context. R-squared has been proposed in many fields, but there is no standard guideline. In academic research,
R-squared values of 0.75, 0.5, and 0.25 can be described respectively as strong, moderate, and weak (Henseler et al. 2009).
Logistic regression is a supervised machine learning classifier used to predict categorical variables or discrete values. There
are many applications of logistic regression, such as determining the probability of having a heart attack according to weight
or exercise, filtering emails, or calculating the probability of getting accepted in a contest. We have seen that linear regres-
sion solves regression problems and predicts continuous values. Logistic regression solves classification problems by pro-
viding outputs such as 0 and 1, positive and negative, or multiple classes; multinomial logistic regression is widely used in
text classification or part-of-speech labeling. In logistic regression, we are not looking for the best-fitting line but rather
building an s-shaped curve, called a logistic function, that lies between 0 and 1.
Logistic regression extracts continuous features from the input, multiplies each by a weight, sums them, and passes the
sum through a sigmoid function to generate a probability. For decisions, we include a threshold function (sigmoid or logistic
function) such that any value above the threshold will tend toward 1 and any value below the threshold will tend toward 0.
The weights (vector w and bias b) are learned from a training dataset that is labeled and use a loss function, such as the cross-
entropy loss, that needs to be minimized.
In this section, we will learn the concepts of sigmoid function or logistic function, as well as the logit function, odds ratio,
and cross-entropy.
0.8
We can represent the equation above as follows (with a dot
product):
0.6
z=w x+b
0.4
So far, nothing in the equation above forces us to be a probability
0.2 lying between 0 and 1, as z ranges from − ∞ to + ∞. We can pass
the sum through a sigmoid function (Figure 3.1) to generate a prob-
0.0 ability, also called a logistic function, allowing us to map real
–10.0 –7.5 –5.0 –2.5 0.0 2.5 5.0 7.5 10.0 values into the range [0, 1]:
Now that we have obtained a number in the range [0, 1] by applying the sigmoid function to the sum of the weighted
features, we state a probability, such as P( y = 1) + P( y = 0) = 1:
1
P y=1 =σ z =σ w x+b =
1+ e− w x + b
1 e− w x+b
P y = 0 = 1−σ z = 1−σ w x + b = 1− =
1 + e− w x+b 1 + e− w x+b
P y=1 +P y=0 =1
We expect our model to produce outputs based on probability scores between 0 and 1. Let us say we would like our model
to identify whether patients have a disease (target = 1) or not (target = 0) according to a certain number of features. By
defining a threshold value, for example, 0.5, we can decide in which category the patient belongs. If the prediction function
returns a value of 0.6, we would classify the patient in the “no disease” category.
1 m 2
J θ = hθ x i − y i
2m i = 1
In the case of a univariate linear regression, the following has been stated:
hθ x = y = θ 0 + θ 1 x
The problem with using the same cost function for logistic regression is that it will end up as a non-convex function with
several local minima. It will be challenging to minimize the cost value and find the global minimum.
J(θ)
Global minima
θ
We need a loss function that expresses how close the model output σ(w x + b) is to the correct output ( y = 0 or 1) given an
observation x. The idea is to have a loss function that tends to select the correct labels for the observed (training) data to be
more probable. This goal can be achieved by maximizing a likelihood function, and this is what is called conditional max-
imum likelihood estimation. We will choose the weights and bias (w, b) that maximize the log probability of the true y labels
in the training data given the observed data x. For logistic regression, the loss function is the negative log-likelihood loss, also
called cross-entropy loss:
1 m i
J θ = − y log hθ x i + 1 − yi log 1 − hθ x i
mi=1
204 3 Machine Learning Algorithms
where hθ(xi) = θ0 + θixi, (xi, yi) is the ith training data, and m is the number of training examples. The cost function is com-
posed of two parts:
− log hθ x , y=1
J hθ x , y = f x =
− log 1 − hθ x , y=0
The cost function can be simplified by the notation we have used above:
J σ w x + b , y = − y log σ w x + b + 1 − y log 1 − σ w x + b
where z = [z1, z2, z2, …, zK] is a vector of dimension K and 1 ≤ i ≤ K. The softmax of z is also a vector:
ez1 ez2 ez K
softmax z = K
, K
, …, K
ez j ezj ezj
j=1 j=1 j=1
If we have K classes, we will have K different weight vectors. We will have a matrix packing all the weight vectors together
and a vector output y. In binary logistic regression, we use a single weight vector w and a scalar output.
In multinomial logistic regression, we also use maximum likelihood estimation. The loss function generalizes for binary
logistic regression from 2 to K classes. Both y and σ(w x + b) can be represented as vectors of K elements ( y, y):
K
J y, y = yk log yk
k=1
where y is called a one-hot vector, meaning that all positions in the vector are 0 except the entry representing the class that
the observations fall into is 1.
Finally, we can apply gradient descent to minimize the cost.
We have seen binary logistic regression for data having two types of possible output and multinomial logistic regression
for more than two categories of output. We can also explore ordinal logistic regression for data having more than two cate-
gories but in ordering.
To practice and apply logistic regression, we will use the famous Fashion MNIST dataset, which is a dataset of 70,000
28 × 28 labeled Zalando’s article images (784 pixels in total per image). There is a training set of 60,000 examples that we
will use here. A test set of 10,000 examples is also available. The dataset is available at Kaggle: https://www.kaggle.com/
datasets/zalando-research/fashionmnist. Each pixel has a single value between 0 and 255, indicating the lightness of that pixel;
higher values are darker pixels. When the data are extracted, we obtain 785 columns. One column is the labels (0: T-shirt/
top, 1: trouser, 2: pullover, etc.); the rest of the columns are the 784 features, which are the pixel numbers and their
respective values.
# Import Libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Divide the data, y the variable to predict (label) and X the features
X = df[df.columns[1:]]
y = df['label']
Output:
label pixel1 pixel2 pixel3 pixel4 pixel5 pixel6 pixel7 pixel8 pixel9 ... pixel775 pixel776 pixel777 pixel778 pixel779 pixel780 pixel781 pixel782 pi
0 2 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0
1 9 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0
2 6 0 0 0 0 0 0 0 5 0 ... 0 0 0 30 43 0 0 0
3 0 0 0 0 1 2 0 0 0 0 ... 3 0 0 0 0 1 0 0
4 3 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0
10
15
20
25
0 5 10 15 20 25
Input:
features = X.loc[2].values
print("Actual Label: ", Labels[y.loc[2]])
plt.imshow(features.reshape(28,28))
Output:
Actual Label: Shirt
<matplotlib.image.AxesImage at 0×12a3ccc10>
10
15
20
25
0 5 10 15 20 25
The pixel values range from 0 to 255. Therefore, dividing all values by 255 will convert them to a range from 0 to 1.
Input:
X = X/255
X.head()
3.2 Logistic Regression 207
Output:
pixel1 pixel2 pixel3 pixel4 pixel5 pixel6 pixel7 pixel8 pixel9 pixel10 ... pixel775 pixel776 pixel777 pixel778 pixel779 pixel780 pixel781
0 0.0 0.0 0.0 0.000000 0.000000 0.0 0.0 0.000000 0.0 0.0 ... 0.000000 0.0 0.0 0.000000 0.000000 0.000000 0.0
1 0.0 0.0 0.0 0.000000 0.000000 0.0 0.0 0.000000 0.0 0.0 ... 0.000000 0.0 0.0 0.000000 0.000000 0.000000 0.0
2 0.0 0.0 0.0 0.000000 0.000000 0.0 0.0 0.019608 0.0 0.0 ... 0.000000 0.0 0.0 0.117647 0.168627 0.000000 0.0
3 0.0 0.0 0.0 0.003922 0.007843 0.0 0.0 0.000000 0.0 0.0 ... 0.011765 0.0 0.0 0.000000 0.000000 0.003922 0.0
4 0.0 0.0 0.0 0.000000 0.000000 0.0 0.0 0.000000 0.0 0.0 ... 0.000000 0.0 0.0 0.000000 0.000000 0.000000 0.0
We will now split the data into training and testing sets, initialize the model, and train it on the training dataset. Once the
model has been trained, we can perform predictions with the test dataset and print metrics for the model (classification
accuracy, precision, recall, f1-score). We can also perform a cross-validation.
Input:
# Sckit-learn implementation
# Model initialization
logistic_model = LogisticRegression(solver='sag', multi_class='auto')
# Fit the data (train the model)
logistic_model.fit(X_train, y_train)
# Model prediction on new data (X_test)
y_pred = logistic_model.predict(X_test)
print(y_pred)
Output:
[7 8 8 ... 9 5 5]
print(classification_report(y_test,y_pred))
208 3 Machine Learning Algorithms
Output:
import numpy as np
import pandas as pd
import tensorflow as tf
# Divide the data, y the variable to predict (label) and X the features
X = df[df.columns[1:]]
y = df['label']
# As the pixel values range from 0 to 256, apart from 0 the range is 255.
# So dividing all the values by 255 will convert it to range from 0 to 1.
X = X/255
keras_model = Sequential()
keras_model.add(Flatten(input_dim=number_of_features))
keras_model.add(Dense(number_of_classes, activation='softmax'))
Output:
Output:
[0.5115869045257568, 0.8264166712760925]
Input:
y_keras_pred = keras_model.predict(X_test)
print(y_keras_pred)
print('\n')
print(classification_report(y_test,
(tf.argmax(keras_model.predict(X_test), axis=1)).numpy()))
Output:
210 3 Machine Learning Algorithms
import numpy as np
import pandas as pd
import tensorflow as tf
# Divide the data, y the variable to predict (label) and X the features
X = df[df.columns[1:]]
y = df['Outcome'].values.ravel()
X_test_ = Normalizer().fit_transform(X_test_)
model_diab = Sequential()
model_diab.add(Flatten(input_dim=number_of_features))
keras_model.add(Dense(number_of_classes, activation='sigmoid'))
model_diab.compile(optimizer='adam', loss='binary_crossentropy')
predictions = model_diab.predict(X_test_)
print(predictions)
print('\n')
print(classification_report(y_test_,
(tf.argmax(model_diab.predict(X_test_), axis=1)).numpy()))
Output:
A support vector machine (SVM) is a supervised learning algorithm that can be used for prediction of both binary variables
(classification) and quantitative variables (regression problems), although it is primarily used for classification problems.
The goal of SVM is to create a hyperplane that linearly divides n-dimensional data points into two components by searching
212 3 Machine Learning Algorithms
for an optimal margin that correctly segregates the data into different
Support classes and at the same time is separated as much as possible from all
vectors the observations. In addition to linear classification, it is also possible
to compute a nonlinear classification using what we call the kernel
trick (a kernel function) that maps inputs into high-dimensional fea-
ture spaces. The kernel function, when adapted to specific problems,
Ma allows flexibility to adapt to different situations. SVM allows creation
marximize
gin
of a classifier, or a discrimination function, that we can generalize and
Support apply for predictions such as in image classification, diagnostics,
vectors
genomic sequences, or drug discovery. SVM was developed at
Figure 3.2 In SVM, we need to maximize the margin, AT&T Bell Laboratories by Vladimir Vapnik and colleagues. To select
which is defined by a subset of training samples, the the optimal hyperplane from among many hyperplanes that might
support vectors. It is a quadratic programming problem
classify our data, we select the one that has the largest margin
that we can solve by standard methods.
(Figure 3.2) or, in other words, that represents the largest separation
between the different classes. It is an optimization problem under
constraints in which the distance between the nearest data point
and the optimal hyperplane (on each side) is maximized. The optimal
hyperplane is then called the maximum-margin hyperplane, allowing
us to create a maximum-margin classifier. The closest data points are
known as support vectors, and the margin is an area that generally
does not contain any data points. If the optimal hyperplane is too close
to the data points and the margin too small, it will be difficult to pre-
dict new data and the model will fail to generalize well. In nonlinear
cases, we need to introduce a kernel function to search for nonlinear
separating surfaces. The method induces a nonlinear transformation
of our dataset toward an intermediate space that we call a feature
space of higher dimension.
In this chapter, we will explore linearly and not fully linearly sep-
arable binary discrimination as well as nonlinear SVMs (Figure 3.3)
Figure 3.3 In SVM, we can face different binary and SVMs for regression.
discrimination issues.
x i , yi where i = 1, n , yi − 1, 1 , x Rd
||
x1
b
||w
If d = 2, the data are linearly separable if we can draw a line separat-
ing the two classes (Figure 3.4) in a graph of two dimensions (x1
versus x2). If d > 2, we refer to a hyperplane on graphs (x1, x2, …, xn) that
Figure 3.4 For d = 2, the data are linearly separable if can be described by w x + b = 0 where w is a vector normal to the
we can draw a line separating the two classes in a graph hyperplane, b is an offset, and wb is the perpendicular distance from
of two dimensions (x1 versus x2).
the hyperplane to the origin.
3.3 Support Vector Machine 213
The objective is to find the variables w and b that describe our training data as follows:
x i w + b ≥ + 1, for yi = + 1
x i w + b ≤ − 1, for yi = − 1
n
∂ P
= λi y i = 0
∂b i=1
Instead of minimizing over w and b subject to constraints involving Lagrange multipliers λ, we can maximize over λ sub-
ject to relationships obtained previously for w and b. We can eliminate the dependance on w and b by substituting for w and
b back in the original equation min P :
n n
1
D λi = λi − λi λj y i y j x i x j such that λi yi = 0 and λi ≥ 0
i=1
2 i, j i=1
As we can see, the dual form requires only the dot product of each input vector xi to be computed.
214 3 Machine Learning Algorithms
We can solve it using a quadratic programming solver that will output λ and allow us to calculate w based on the
following:
n n
∂ P
= w− λi yi x i = 0 w= λi y i x i
∂w i=1 i=1
n
∂ P
The data points satisfying ∂b = λi yi = 0, which are the support vectors xSV, will have the following form:
i=1
ySV x SV w + b = 1
n
We can substitute the expression above in w = i = 1 λi yi x i , yielding the following:
1
b= ySV − λm ym x m x SV
N SV S m S
We can then define the optimal hyperplane, as we were able to calculate w and b.
In the end, we can predict the new points x by the evaluation of y = sign (w x + b).
x2 x2
gin gin
ar ar
M M
Point
Support violating
vector constraint
x1 x1
Hard margin Soft margin
x2
x1
Hard margin
Figure 3.5 We need to make a trade-off between the width of the margin and the number of training errors committed by the linear
decision boundary. A soft margin leads to underfitting whereas a hard margin leads to overfitting.
To allow data points to be on the wrong side or within the margin area, we can introduce a slack variable (Figure 3.6):
x i w + b ≥ + 1 − ξi , for yi = + 1
x i w + b ≤ − 1 + ξi , for yi = − 1
where ξi ≥ 0 i, i = 1, …, n.
We can combine the two expressions as follows:
yi x i w + b − 1 + ξi ≥ 0, where ξi ≥ 0 i
1
min w 2, such that yi x i w + b − 1 ≥ 0, i
2 w
In a case that is not fully linearly separable, we can adapt the expression above
by introducing the slack variable and a parameter C that controls the trade-off:
n
1 2
min w +C ξi such that yi x i w + b − 1 + ξi ≥ 0, i –ξ
2 i=1 |w|
x1
Allocating Lagrange multipliers (αi ≥ 0 and βi ≥ 0), we can write the following:
n n n
1 2 Figure 3.6 Introduction of a slack
P = w + ξi − λi yi x i w + b − 1 + ξi − β i ξi
2 variable to separate two nonlinearly
i=1 i=1 i=1
separable classes.
216 3 Machine Learning Algorithms
From the property that the derivative at a minimum is equal to 0, we obtain the following:
n n
∂ P
= w− λi yi x i = 0 w= λ i yi x i
∂w i=1 i=1
n
∂ P
= λi yi = 0
∂b i=1
∂ P
=0 C = λi + β i
∂ξi
We now need to maximize D and find i:
n n
1
max λi − λT Hλ such that λi yi = 0 and 0 ≤ λi ≤ C
λ 2
i=1 i=1
x2
ℓ(•)
x2 x2
ℓ(•) ℓ(•)
ℓ(•)
ℓ( )
ℓ(•)
ℓ( )
ℓ( ) ℓ(•)
x1
ℓ( )
ℓ( )
x3
x1 x1
ℓ Data profected to ℓ2 (non separable) Data in R3 (Separable)
Transformation to separate
x2
Vector
support
Vector
D support
x
x
Figure 3.7 The objective of nonlinear SVMs is to gain separation by mapping the data to higher dimensional space because many
classification or regression problems are not linearly separable or regressable in the space of the inputs x. For this, we use “kernel trick” to
move to a higher dimensional feature space.
3.3 Support Vector Machine 217
where (xi xj) is the dot product of the two feature vectors. If we now transform to ϕ instead of calculating the dot product
(xi xj), we need to compute (ϕ(xi) ϕ(xj)), which can be very expensive and time consuming. If we introduce a kernel func-
tion K such that K(xi, xj) = ϕ(xi) ϕ(xj), we do not need to calculate ϕ. The kernel functions allow us to have only inner
products of the mapped inputs in the feature space determined without calculating ϕ.
There are many popular kernel functions that we can use such as the following:
•• Linear: K x i , x j = x Ti x j
Gaussian radial basis function: K x i , x j = e −
xi − xj
2σ 2
2
••
xi − xj
Laplacian: K x i , x j = e − σ 2
xi − xj
Rational quadratic: K x i , x j = 1 − 2
xi − xj +c
• Multiquadratic: K x i , x j = xi − xj
2
+c
• θ xi − xj
Wave: K x i , x j = sin
xi − xj θ
Defining the proper kernel (Figure 3.8) will allow us to make the nonlinearly separable dataset in the data space x sep-
arable in the nonlinear feature space defined implicitly by the chosen nonlinear kernel function.
The new points x are classified as follows: y = sign(w ϕ(x ) + b).
x i , yi where i = 1, n , yi − 1, 1 , x Rd
We now wish to predict a real-valued output for yi:
x i , yi where i = 1, n , yi R, x Rd
and yi = w xi + b.
x2 x2
x1 x1
Linear Polynomial (2nd)
x2 x2
x1 x1
Radial basis Sigmoid
Figure 3.8 Defining the proper kernel will allow us to make the nonlinearly separable dataset in the data space x separable in the
nonlinear feature space defined implicitly by the chosen nonlinear kernel function.
218 3 Machine Learning Algorithms
+y
prediction error is ignored if the difference between the predicted value yi
+3
y
and the actual value yi is smaller than a distance ϵ (the ϵ-insensitive loss
ξ+ > 0
function or ϵ-insensitive tube):
–y
–3
yi ≤ yi + ϵ + ξ +
yi ≥ yi − ϵ − ξ −
ξ– < 0
The output variables outside the tube are given one of two slack variable
penalties where i, ξ+ > 0 and ξ− > 0. They are assigned depending on
whether they lie above ξ+ or below ξ− in the tube (Figure 3.9).
x
The main goal remains to minimize error and to individualize the hyper-
plane to maximize the margin.
Figure 3.9 Regression with an ϵ-insensitive tube. The error function that we need to minimize can be written as follows:
n
1
C ξi+ + ξi− + w 2, with ξ + ≥ 0 and ξ − ≥ 0, i
i=1
2
As described above, we introduce Lagrange multipliers to minimize the function subject to the constraints:
n n n
1
P = w 2
+C ξi+ + ξi− − αi+ ξi+ + αi− ξi− − λi+ yi − yi + ϵ + ξ +
2 i=1 i=1 i=1
n
− λi− − yi + yi + ϵ + ξ −
i=1
∂ P
=0 C = λi + + α i +
∂ξi+
∂ P
=0 C = λi− + αi−
∂ξi−
n n
1
D = λi+ − λi− yi − ϵ λi+ − λi− − λ + − λi− λj+ − λj− x i x j
i=1 i=1
2 i, j i
where i, λi +
, λi− ≥ 0.
n n
1
max λi+ − λi− yi − ϵ λi+ − λi− − λ + − λi− λj+ − λj− x i x j
λ + , λ−
i=1 i=1
2 i, j i
1
b= ySV − ϵ − λm+ − λm− x m x s
N SV S m S
3.3 Support Vector Machine 219
Principal cells:
•• Ganglion: 317
Granule: 1130
••
Purkinje: 497
Pyramidal: 14002
Interneurons:
••
Basket: 503
Chandelier: 26
•
Martinotti: 137
220 3 Machine Learning Algorithms
•• Double bouquet: 50
Bitufted: 67
• Nitrergic: 2044
Glial cells:
••
Microglia: 7549
Astrocyte: 1607
Let us apply SVM techniques to these data. We have been referring so far to binary classification. Natively, even though
SVM does not support multi-class classification in its simplest form, we can use it for multi-class classification by applying
the same principles after breaking down the multi-classification problem into multiple binary classification problems.
A first approach in the process, called one-to-one, is to map data points to high-dimensional space to gain mutual linear
separation between every pair of classes. A second approach is called one-to-rest, in which breakdown is set to a binary
classifier for each class. Fortunately, most of the data science frameworks perform it automatically.
# Importing libraries
import pandas as pd
from sklearn import preprocessing
from sklearn import metrics
from sklearn import svm
from sklearn.model_selection import cross_val_score
# Divide the data, y the variable to predict (Target) and X the features
X = df[df.columns[1:]]
y = df['Target']
# Model
model=svm.SVC(kernel='linear')
3.3 Support Vector Machine 221
model.fit(X_train,y_train)
y_pred = model.predict(X_test)
# Metrics
results = [metrics.accuracy_score(y_test, y_pred),metrics.precision_score(y_test,
y_pred, average='micro'),metrics.recall_score(y_test, y_pred, average='micro'),
metrics.f1_score(y_test, y_pred, average='micro'), cross_val_score(model, X_train,
y_train, cv=5).mean(), cross_val_score(model, X_train, y_train, cv=5).std()]
metrics_dataframe = pd.DataFrame(results, index=["Accuracy", "Precision", "Recall",
"F1 Score", "Cross-validation mean", "Cross-validation std"], columns=
{'SVM_linear'})
metrics_dataframe
Output:
SVM_linear
Accuracy 0.801471
Precision 0.801471
Recall 0.801471
F1 Score 0.801471
To change the kernel of the SVM, we simply modify the following line:
# Model
model=svm.SVC(kernel='linear')
We specify the kernel we want to apply (linear, poly, rbf, sigmoid, precomputed):
# Model
model=svm.SVC(kernel=sigmoid)
Applying different kernels produces the results below, showing the impact of each of them on the performance of
the model:
222 3 Machine Learning Algorithms
Another process that influences the results is the method we use to scale the data, as shown in the results below from the
same data:
model=svm.SVR(kernel=linear)
The above code corresponds to linear support vector regression. We can replace the kernel with the other options (poly, rbf,
sigmoid, etc.).
3.4 Artificial Neural Networks 223
Artificial neural networks (ANNs) have become a central concept in the field of modern machine learning, addressing a full
range of complex problems in classification, regression, image processing, forecasting, speech recognition, NLP, and other
applications. They are inspired by the functionality of the human brain and were first introduced by McCulloch and Pitts to
model a biological neuron. The idea behind neural networks is that a network of neurons can be constructed by connecting
multiple neurons together such that the output of one neuron forms an input to another. Different types of architectures
exist for neural networks. The oldest and most simple model is the MLP introduced by Rosenblatt. Convolutional neural
networks (CNN), which are particularly suited for image processing, have been developed recently. We can also cite another
famous neural network architecture, recurrent neural networks, which can be used for sequential data that occur in time
series or text. As we will see, in an ANN we have an input x and an output y = f(x,θ) where the parameters θ are estimated
from a learning sample. As is usual in statistical learning, we need to minimize a function that is not convex, implying local
minimizers. Cybenko (1989) and Hornik (1991) have proposed a universal approach, whereas Le Cun (1989) has introduced
backpropagation to compute the gradient of a neural network.
Let us consider the following artificial neuron:
d
yj = f j x = ϕ wji x i + bj = ϕ wj , x + bj
i=1
The artificial neuron is represented by the function fj. This function has an input x = (x1, …, xd), weighted by a vector of
d
connection weights wj = (wj, 1, …, wj, d). We also have neuron bias bj. i = 1 wj,i x i + bj called the summation. In addition,
d
we have an activation function ϕ, specifically ϕ i = 1 wj,i x i + bj . We need to consider activation functions such as the
identity function ϕ(x) = x or the sigmoid ϕ x = 1
1 + e−x.
Figure 3.10 presents a schematic representation of an artificial neuron.
As stated above, several activation functions can be considered (Figure 3.11):
• Sigmoid function: ϕ x =
1
1 + e − βx
•
0 if x < β
Hard threshold: ϕβ x = 1x ≥ β =
1 if x ≥ β
• e − e − x e2x − 1
x
Hyperbolic tangent (tanh): ϕ x = = 2x
ex + e − x e + 1
0 if x ≤ x min
• 1 − x−μ 2
Gaussian: ϕ x = e 2σ2
2πσ
Figure 3.10 Schematic representation of an artificial neuron. Inputs Weights jth neuron Neuron’s
output
wj,1
x1 Summation Activation
⋮ d
wj,i Sj = ∑ wj,i xi + bj fj = ϕ (sj) yj
xi
i=1
⋮
wj,d
xd
224 3 Machine Learning Algorithms
–1
1
ϕ(x) = x ϕ(x) =
1+ e–x
1
1
0 1 0
(1) (2)
h1 h1
x1
(1) (2)
h2 h2
x2 y
(1) (2)
⋮ h3 h3
xd
⋮ ⋮
(1) (2)
hm hm
in an MLP, there are no links between neurons inside the same layer, but each neuron of a layer is linked to all neurons of
the next layer. The output of a neuron in a hidden layer becomes the input of another neuron in the next layer. The last layer
is called the output layer. Depending on the problem we are addressing, either classification or regression, we can apply a
different activation function in the last hidden layer. For regression, no activation function is applied. In fact, we apply the
identity function, but it does nothing. To run a neural network, we will need to choose a certain number of parameters such
as the number of hidden layers, the number of neurons in each layer, the activation function, and the activation function of
the last hidden layer. In binary classification problems, each output unit implements a threshold function for which the
output value can be, for example, 0 or 1 depending on a prediction P(Y = 1/X) that has generated a value between 0
and 1 and to which we apply a threshold. For binary classification, we can use the sigmoid activation function, for instance,
because its output is a value between 0 and 1. For multi-class, we place one neuron per class (i) in the output layer, making
the sum of each prediction P(Y = i/X) equal to 1. In this case, we can use the softmax function.
Let us write out the MLP mathematically. As we will see, we will need to choose a notation and apply the same statistical
learning concepts that we have seen previously. Let us say that xi are the input units (i = 1, …, d), y is the output unit, and hkj
represents the units in the kth hidden layer.
We set h(0) = x and establish the following:
wji hkj − 1 x + bj
k k
hkj x = ϕ with k = 1, , L hidden layers
i
wji hkj − 1 x + bj
k k
yj = ψ with k = L + 1 output layer
i
where ϕ is the activation function and ψ is the activation function of the output layer.
We can write the vectorized form as follows:
hk = ϕ W k
h k−1 + b k with k = 1, , L hidden layers
y=ψ W k
h k−1 + b k with k = L + 1 output layer
where W(k) is the weight matrix with the number of rows being the number of neurons in the layer k and the number of
columns being the number of neurons in the layer (k − 1), h(k) is an activation vector, and b(k) is a bias vector present in
each layer.
The universal approximation theorem states that feed-forward neural networks with as few as one hidden layer are uni-
versal approximators. In other words, a neural network with one hidden layer can approximate any continuous function for
inputs that do not have large gaps, meaning that they are within a specific range. One of the first versions was demonstrated
by George Cybenko (1989) for a sigmoid activation function.
1 m
L= − y log yj + 1 − yj log 1 − yj
mj=1 j
226 3 Machine Learning Algorithms
where yj represents the expected outcome (actual value of the jth sample), yj represents the outcome produced by our model
(predicted value of the jth sample), and m is the number of samples. It can be expressed using another notation:
1 m
Lθ = − y log f X, θ + 1 − yj log 1 − f X, θ
mj=1 j
where θ is the vector of parameters to estimate, f(X, θ) = pθ(Y = 1/X), and Y {0, 1}.
For a multi-class classification problem, we consider a generalization of the previous loss function applied to k classes
(maximum likelihood estimate):
k
L= − yj log yj
j=1
In the case of regression settings, we use the mean squared error. The formula of the loss is the squared difference between
the expected value and the predicted value:
1 2
L= yj − yj
2 j
where ε is the learning rate that we need to calibrate for the algorithm to converge. We could also use Adaptive Moment
Estimation (Adam) or other methodologies to minimize the loss function.
1 m
L= − y log yi + 1 − yi log 1 − yi
mi=1 i
where m is the number of samples, yi is the actual output of the ith sample, and yi is the predicted output of the ith sample.
We need to find the gradients of the loss function with respect to weight and bias:
∂L ∂ 1 m 1 m yi 1 − yi ∂yi
= − y log yi + 1 − yi log 1 − yi = − −
∂W j ∂W j mi=1 i mi=1 yi 1 − yi ∂W j
3.4 Artificial Neural Networks 227
1 m
= y − yi x ij
mi=1 i
x1 1 × m y − y T
m×1
∂L 1 1 1
= y − y1 x 1j + … + ym − ym x mj = xj 1 × m y − y T
= …
∂W j m 1 m m×1
m
xn 1 × m y − y T
m×1
1
= xn ×m y − y T
m×1
m
where x n × m is the input matrix of m samples and n features.
With the same approach, we can now find a gradient of the loss function with respect to bias:
∂L 1 m ∂zi 1 m
= yi − yi = y − yi
∂b mi=1 ∂b mi=1 i
∂zi
As zi = WTxi + b and then = 1.
∂b
The stochastic gradient descent is defined as follows:
∂
θj = θj − ϵ Lθ
∂θj
After calculation of the gradients, we can update the weights and bias with stochastic gradient descent:
∂
Wj = Wj − ϵ L
∂W j
∂
b = b−ϵ L
∂b
wji hkj − 1 x + bj
k k
hkj x = ϕ with k = 1, , L hidden layers
i
228 3 Machine Learning Algorithms
d
k k−1 k
aj x = wji hj x + bj
i=1
P Y =1 x
f x = …
P Y =K x
We will also use the multidimensional function softmax as the activation function:
1
softmax x 1 , …, x K = K
ex1 , …, ex K
ex j
j=1
∂softmax x i
softmax x i 1 − softmax x i , if i = j
=
∂x j − softmax x i softmax x j , if i j
Let us also introduce (f(x))j, which is the jth element of f(x) such that the following is true:
K
f x y = 1y = j f x j
j=1
where f x k =P Y = k
x If we take the logarithm of the expression above, we obtain the following for the loss function L:
K K
L f x , y = − log f x y = 1y = j log f x j = − yj log yj
j=1 j=1
The idea now is to compute the gradients according to weights and biases in both the output and hidden layers:
∂L f x , y ∂L f x , y
h
, h
∂W i,j ∂bi
∂L f x , y ∂L f x , y
L+1
, L+1
∂W i,j ∂bi
∂z J
∂z ∂aj
=
∂x i j=1
∂a j ∂x i
3.4 Artificial Neural Networks 229
1y = j ∂softmax a
L+1
∂L f x , y ∂L f x , y ∂f x j x j
= = −
∂ a L+1 x i j
∂f x j ∂ a L + 1 x i j
f x y ∂ a L+1 x i
1 ∂softmax a L + 1 x y
= −
f x y ∂ a L+1 x i
1
= − softmax a L + 1 x 1 − softmax a L + 1 x y 1y = i
f x y y
1
+ softmax a L + 1 x softmax a L + 1 x 1y i
f x y i y
= −1 + f x y 1y = i + f x i 1y i
∇a L + 1 x L f x , y = f x − e y
where y {1, …, K} and e( y) is the RK vector with ith component 1i = y.
We can now compute the gradients of the loss function according to weights and biases in the output layers.
For the output bias:
∂a L + 1 x j
∇b L + 1 L f x , y = f x − e y as = 1i = j
∂ b L+1
i
∂L f x , y ∂L f x , y ∂ a L+1 x ∂L f x , y
= k
= aL x 1i = k
L+1
∂W i,j k
∂ a L+1 x k
L+1
∂W i,j k
∂ a L+1 x k j
∇W L+1
L f x ,y = f x −e y aL x
We can also compute the gradients of the loss function according to weights and biases in the hidden layers.
As usual, we use the chain rule as follows:
∂L f x , y ∂L f x , y ∂a k + 1 x i ∂L f x , y k+1
= = W ij
∂ h k
x i
∂ a k+1 x i ∂ h k
x i
∂ a k+1 x i
j j
∇h k x L f x , y = W k+1
∇a k + 1 x L f x , y
where
∇a k x L f x , y = ∇h k x L f x , y ⊙ ϕ a k x 1 , …, ϕ a k x j,…
The gradient to the loss function with respect to hidden weights produces the following:
∂L f x , y ∂L f x , y ∂ a k x i ∂L f x , y k−1
= = h x
k
∂W i,j ∂ a k x i ∂W k ∂ ak x i j
i,j
∇W k
L f x , y = ∇a k x L f x , y h k − 1 x
230 3 Machine Learning Algorithms
We can now calculate the gradient with respect to the hidden biases:
∂L f x , y ∂L f x , y
=
∂bi
k ∂a k x i
∇b k L f x , y = ∇a k x L f x , y
Thus, backpropagation provides a way to compute gradients that will serve for the stochastic gradient descent, which
updates the parameters of a model to minimize a loss function by using gradients of the loss function with respect to
the parameters. Backpropagation avoids repeating calculations by computing gradients one layer at a time.
Here, we have seen the case of a multi-class classification problem. We can also calculate the gradient to the loss function
with respect to weights and biases in both the output and hidden layers in binary classification and regression problems by
following the same procedure and taking the corresponding loss functions.
f ∗g = f t g x+t
t
For 2D signals for which we apply a kernel K to a 2D signal I, we have the following:
K ∗ I i, j = K m, n I i + n, j + m
m, n
A kernel convolution used in computer vision algorithms and also in CNNs is the process where we pass a small matrix
of number (the kernel or filter) over our image and transform it based on the values from filter. The input image can
be denoted X and our filter f, given the expression X ∗ f.
To better understand the process, let us consider an image of size 3 × 3 and a filter of
1 3 2 size 2 × 2 (Figure 3.15).
4 78 54 The filter is passed over our main image and performs an element-wise multiplication
1 2
10 14
5 6
6 2 14 such as the following (Figure 3.16):
2 38 42 14
21 3 4 10 × 1 + 5 × 0 + 8 × 1 + 2 × 0 = 18
2 3 4
5 × 1 + 6 × 0 + 2 × 1 + 14 × 0 = 7
Figure 3.13 Schematic 8 × 1 + 2 × 0 + 2 × 1 + 3 × 0 = 10
representation of an input
image (3 × 3 pixels) using the 2 × 1 + 14 × 0 + 3 × 1 + 4 × 0 = 5
RGB model in which we have
three matrices (red, green, blue) For an image of dimensions n × n and a filter of dimensions f × f, the output will be of
storing values from 0 to 255. dimensions (n − f + 1) × (n − f + 1).
3.4 Artificial Neural Networks 231
Figure 3.14 A digital image, such as this picture of a handwritten “7,” can be considered as a matrix of numbers in which each
number corresponds to the brightness of a pixel.
After seeing the convolution layers, let us approach fully connected layers. As explained above, the convolution layer
extracts features from original data and generates a two-dimensional matrix (Figure 3.14). We send these features to a fully
connected layer, a traditional neural network, that generates the final output. The fully con-
nected layer can only work with one-dimensional data; therefore, we need to convert our two- 10 5 6 1 0
dimensional matrix into a one-dimensional format (Figure 3.16). The fully connected layer will
8 2 14 1 0
perform two operations on the input data that we have described:
2 3 4
•• A linear transformation: Z = WT + b
A nonlinear transformation with the activation function Figure 3.15 An image of size
3 × 3 and a filter of size 2 × 2.
232 3 Machine Learning Algorithms
10 5 6
8 2 14 ⁎ 1 0 18 7
2 3 4 1 0
10 5 6
8 2 14 ⁎ 1 0 18 7
2 3 4 1 0 10
Conversion into 1D
10 5 6
8 2 14 ⁎ 1 0 18 7 18 X1
->
2 3 4 1 0 10 5 7 X2
10 X3
5 X4
Figure 3.16 The filter is passed over our main image and performs an element-wise multiplication.
A A A ⋯= A
t–1 t t+1
Figure 3.17 An RNN with certain inputs at (t − 1) that will lead to outputs at time (t − 1). At the next timestamp, the information at (t − 1) is
provided along with the input at time t, eventually providing an output at time t as well. This process is repeated through all the
timestamps in the model.
Let us consider the weight matrix w and the bias b. At time t0 and input y0 y1 y2
x0, we need to find h0 such that the following is true:
wy wy wy
h t = ϕ wi x t
+ wR h t − 1 + bn wR wR wR
h0 h1 h2
Then, we need to calculate y0 according to the following formula:
wi wi wi
y t = ψ w y h t + by
x0 x1 x2
This process is repeated through all the timestamps (Figure 3.18).
Figure 3.18 An RNN with weights w and bias b.
RNNs use backpropagation to train, but it is applied for every timestamp.
This process is commonly called backpropagation through time (BTT).
All these applications that we have seen as examples above are due to the performance of a particular RNN called long
short-term memory (LSTM), which can learn long-term dependencies. LSTM cells were introduced by Hochreiter
and Schmidhuber (1997). An LSTM cell includes at time t, a state Ct, and an output ht. This cell receives inputs from xt,
Ct–1, and ht–1. Inside the LTSM, the computations are defined by doors that either allow the transmission of information
or do not. The computations are performed by equations described by Hochreiter and Schmidhuber.
•• activation: The activation function for the hidden layer (“identity,” “logistic,” “relu,” “softmax,” “tanh”; default =“relu”).
solver: The solver for weight optimization (“lbfgs,” “sgd,” “adam”; default = “adam”).
• learning_rate_init: The initial learning rate used (for sgd or adam). It controls the step size in updating the weights.
234 3 Machine Learning Algorithms
The application below provides us with a dataframe for the accuracy score (the ratio of the number of correct predictions
to all predictions made by the classifier), the precision score (the number of correct outputs or how many of the correctly
predicted cases turned out to be positive), the recall score (how many of the actual positive cases we were able to predict
correctly), the f1-score (the harmonic mean of precision and recall), and the cross-validation score (mean and standard
deviation).
Input:
# Importing libraries
import pandas as pd
from sklearn import preprocessing
from sklearn import metrics
from sklearn import svm
from sklearn.model_selection import cross_val_score
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report
# Divide the data, y the variable to predict (Target) and X the features
X = df[df.columns[1:]]
y = df['Target']
model.fit(X_train,y_train)
y_pred = model.predict(X_test)
Output:
MLP_neural_network
Accuracy 0.773852
Precision 0.773852
Recall 0.773852
F1 Score 0.773852
To define the best hyperparameters automatically, we can use GridSearchCV (from sklearn.model_selection import
GridSearchCV).
236 3 Machine Learning Algorithms
Input:
# Importing libraries
import pandas as pd
from sklearn import preprocessing
from sklearn import metrics
from sklearn.model_selection import cross_val_score
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report
df = neuron.head(22).copy()
df = pd.concat([df, neuron.iloc[17033:17053]])
# Divide the data, y the variable to predict (Target) and X the features
X = df[df.columns[1:]]
y = df['Target']
metrics_dataframe
Output:
accuracy 0.89 9
macro avg 0.90 0.90 0.89 9
weighted avg 0.91 0.89 0.89 9
mlp_neural_network_auto
Accuracy 0.888889
Precision 0.888889
Recall 0.888889
F1 Score 0.888889
The Keras Python library for deep learning focuses on the creation of models as a sequence of layers. In the example
below, we will code a simple MLP neural network using Keras from TensorFlow by defining a sequential model and
specifying all the layers.
As stated previously, we will need the following inputs:
•• The first layer of our model that specifies the shape of the input (input_dim).
The weight initialization, specifically “uniform,” in which weights are initialized to small uniformly random values
between 0 and 0.05, “normal,” in which weights are initialized to small Gaussian random values, or “zero,” in which
weights are set to zero values.
Let us code an example with neurons classified as principal cells or interneurons (a binary classification problem).
238 3 Machine Learning Algorithms
Input:
# Importing libraries
import pandas as pd
import numpy as np
import tensorflow as tf
# Divide the data, y the variable to predict (Target) and X the features
X = df[df.columns[1:]]
y = df['Target']
# For future use, which devices the operations and tensors are assigned to (GPU, CPU)
#tf.debugging.set_log_device_placement(True)
# Model
keras_model = Sequential()
keras_model.add(Dense(number_of_classes, activation=gpu_mlp_activation))
keras_model.add(Flatten(input_dim=number_of_features))
keras_model.compile(optimizer = gpu_mlp_optimizer,
loss = 'binary_crossentropy',
metrics = [gpu_mlp_loss])
keras_model.fit(X_train, y_train, epochs=gpu_mlp_epochs)
keras_model.evaluate(X_test, y_test) # loss, sparse_categorical_accuracy
# Compute and print predicted output with X_test as new input data
print('Print predicted output with X_test as new input data \n')
print('\n')
print('Predictions: \n', y_keras_test)
print('\n')
print('Real values: \n', y_test)
print('\n')
metrics_dataframe
240 3 Machine Learning Algorithms
Output:
Predictions:
[0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 1 0 0 0 1 0 1 1 1 1
1 0 1 0 1 0 0 1 1 1 0 1 1 1 1 1 0 1 0 1 1 1 1 1 0 1 1 0 1 0 0 1 0 1 1 1 0
1 1 1 1 0 0 0 0 1 1 1 1 1 1 0 1 1 0 1 1 1 0 1 1 0 1 1 1 0 0 0 0 1 1 1 0 1
0 0 0 1 1 0 1 0 0]
Real values:
110 1
287 1
17298 0
77 1
181 1
..
17131 0
17166 0
148 1
17227 0
17171 0
Name: Target, Length: 120, dtype: int64
gpu_mlp
Accuracy 0.758333
Precision 0.758333
Recall 0.758333
F1 Score 0.758333
If we wish to use a MLP neural network for regression data, the code is similar. We define the function to create the
baseline model. The difference is that we do not use an activation function for the output layer.
Below is an example of a function we could create for a regression problem.
3.4 Artificial Neural Networks 241
Input:
"""
Multi-Layer perceptron using GPU for regression
Inputs:
X,y non splitted dataset separated by features (X) and labels (y). This is used
for cross-validation
X_train, y_train selected dataset to train the model separated by features
(X_train) and labels (y_train)
X_test, y_test: selected dataset to test the model separated by features
(X_test) and labels (y_test)
cv: number of k-folds for cross-validation
gpu_mlp_epochs: The number of epochs (integer)
gpu_mlp_activation_r: The activation function such as softmax, sigmoid, linear
or tanh.
Output:
A DataFrame with the following metrics:
- Root mean squared error (MSE)
- R2 score
"""
# Model creation
keras_model = Sequential()
keras_model.add(Dense(number_of_features, input_shape=(number_of_features,),
kernel_initializer='normal', activation=gpu_mlp_activation_r))
keras_model.add(Dense(1, kernel_initializer='normal'))
keras_model.compile(loss='mean_squared_error', optimizer='adam')
# Model prediction
y_pred = keras_model.predict(X_train)
# Compute and print predicted output with X_test as new input data
print("\n")
print('Print predicted output with X_test as new input data \n')
print('\n')
print('Predictions: \n', keras_model.predict(X_test))
print('\n')
print('Real values: \n', y_test)
print('\n')
# Printing metrics
print("MLP for Regression Metrics on GPU \n")
print('Root mean squared error: ', mse)
print('R2 score: ', r2)
print("Intercept:", bias)
print("Weights:",weights)
print('\n')
return metrics_dataframe
Input:
import tensorflow as tf
# Load data
DailyDelhiClimateTrain = '../data/datasets/DailyDelhiClimateTrain.csv'
df = pd.read_csv(DailyDelhiClimateTrain, delimiter=',')
# Divide the data, y the variable to predict (Target) and X the features
X = df[df.columns[1:]]
y = df['Target']
# Model prediction
y_pred = model_lstm.predict(X_train)
# Compute and print predicted output with X_test as new input data
print("\n")
print('Print predicted output with X_test as new input data \n')
print('\n')
print('Predictions: \n', model_lstm.predict(X_test))
print('\n')
print('Real values: \n', y_test)
print('\n')
# Printing metrics
print("RNN on GPU \n")
print('Root mean squared error: ', mse)
print('R2 score: ', r2)
print("Intercept:", bias)
print("Weights:",weights)
print('\n')
metrics_dataframe
3.4 Artificial Neural Networks 245
Output:
Epoch 1/50
37/37 [==============================] - 3s 27ms/step - loss: 133.7000
Epoch 2/50
37/37 [==============================] - 1s 30ms/step - loss: 46.5097
Epoch 3/50
37/37 [==============================] - 1s 30ms/step - loss: 45.7743
Epoch 4/50
37/37 [==============================] - 1s 30ms/step - loss: 46.2194
Epoch 5/50
37/37 [==============================] - 1s 33ms/step - loss: 44.2782
Epoch 49/50
37/37 [==============================] - 2s 62ms/step - loss: 4.0243
Epoch 50/50
37/37 [==============================] - 2s 60ms/step - loss: 4.9160
Predictions:
[[35.005676]
[14.977965]
...
[13.91688]
[31.52404]]
Real values:
892 35.875000
1106 18.000000
413 15.250000
522 38.500000
1036 24.000000
...
1362 31.240000
802 21.500000
651 24.500000
722 9.875000
254 31.166667
RNN on GPU
RNN
MSE 3.426274
R-squared 0.936434
# Download mnist data and split into train and test sets
(X_train, y_train), (X_test, y_test) = mnist.load_data()
Output:
0
10
15
20
25
0 5 10 15 20 25
To create our convolutional neural network model, we will need to perform some data preprocessing, such as reshaping the
data, and perform one hot encoding. We will then build our model, which means setting the parameters and hyperpara-
meters such as activation functions, the optimizer, and the loss function, training the model, and finally using the model to
predict new data.
3.4 Artificial Neural Networks 247
In our model, we will reshape the data to fit the model with 60,000 images for training and 10,000 for testing. The images
have a size of 28 × 28, and we will set the image as greyscale. We will use one hot encoding for the target column (y_train and
y_test), which means that we will create a column for each output category. For example, for an image with the number “3,”
we will have [0, 0, 1, 0, 0, 0, 0, 0, 0, 0]. We will then create our model using Sequential() from Keras to build the model layer by
layer. We will create two convolutional layers (Conv2D), the first one with 60 nodes and the second one with 30 nodes. These
numbers can be adjusted. Depending on the size of the dataset, the numbers can be higher or lower. We will also use relu as
the activation function for our first two layers and softmax for the last one. The size of the filter matrix for the convolution will
be set to 3 (3 × 3). A flattened layer is added between the convolutional and dense layers to connect both of them. The dense
layer is our output layer. The output will be series of arrays with probabilities because we have chosen the softmax activation
function for the last layer. Summing the arrays will yield 1. Taking the highest probability will give us the predicted number.
Input:
import tensorflow as tf
from keras.datasets import mnist
import matplotlib.pyplot as plt
from tensorflow.keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Dense, Conv2D, Flatten
# Download mnist data and split into train and test sets
(X_train, y_train), (X_test, y_test) = mnist.load_data()
# Here we reshape the data to fit model with 60000 images for training, image size is
28x28
# 1 means that the image is greyscale (one channel).
# If we want to use RGB values (a color image), 3 can be used.
X_train = X_train.reshape(60000,28,28,1)
X_test = X_test.reshape(10000,28,28,1)
print(y_train[8])
# We set the kernel_size parameter to 3 which means that the size of the filter matrix
for the convolution is 3x3.
# The input_shape is simply the size of our images and 1 means that the image is
greyscale
model.add(Conv2D(60, kernel_size=3, activation='relu', input_shape=(28,28,1)))
model.add(Conv2D(30, kernel_size=3, activation='relu'))
# Here we add a Flatten layer between the Convolutional layers and the Dense layer in
order to connect both of them.
model.add(Flatten())
# The Dense layer is our output layer (standard) with the softmax activation function
in order to make the output sum up to 1.
# It means that we will have "probabilities" to predict our images
model.add(Dense(10, activation='softmax'))
# Compile model
# We use the adam optimizer and the categorical cross-entropy loss function.
# We use accuracy to measure model performance
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=
['accuracy'])
Output:
[0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
Epoch 1/5
1875/1875 [==============================] - 75s 40ms/step - loss: 0.2565 - accuracy:
0.9518 - val_loss: 0.1033 - val_accuracy: 0.9706
Epoch 2/5
1875/1875 [==============================] - 74s 40ms/step - loss: 0.0680 - accuracy:
0.9797 - val_loss: 0.0817 - val_accuracy: 0.9738
Epoch 3/5
1875/1875 [==============================] - 75s 40ms/step - loss: 0.0456 - accuracy:
0.9862 - val_loss: 0.0777 - val_accuracy: 0.9789
Epoch 4/5
1875/1875 [==============================] - 78s 41ms/step - loss: 0.0336 - accuracy:
0.9895 - val_loss: 0.0932 - val_accuracy: 0.9767
Epoch 5/5
1875/1875 [==============================] - 73s 39ms/step - loss: 0.0275 - accuracy:
0.9915 - val_loss: 0.0945 - val_accuracy: 0.9760
3.5 Many More Algorithms to Explore 249
Predicted values:
[[3.0388667e-11 4.6775282e-16 2.5344246e-10 6.1128755e-09 6.3794181e-15
4.2486505e-12 9.2009594e-21 1.0000000e+00 1.9324492e-12 6.5055183e-10]
[1.6037160e-09 7.9608098e-09 9.9999225e-01 1.2304556e-10 3.5661774e-14
4.2955827e-13 7.7231443e-06 1.9912785e-13 3.4886691e-09 1.8349003e-16]
[8.1152383e-09 9.9996758e-01 4.8646775e-08 1.1176332e-12 2.9895480e-06
3.7775965e-08 9.3667410e-08 2.0743075e-08 2.9326071e-05 1.7420209e-08]
[1.0000000e+00 5.1645927e-12 2.5680147e-10 7.0902344e-13 5.4258855e-15
2.6562469e-10 3.5401580e-08 1.8963197e-12 3.2704232e-11 1.0075266e-10]
[9.7294602e-14 1.1301348e-12 6.7401423e-15 1.9974581e-15 1.0000000e+00
2.0294430e-12 2.5000656e-15 1.6443602e-10 5.6247194e-11 3.1983134e-09]]
Actual values:
[[0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
[0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
[0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]]
10
15
20
25
0 5 10 15 20 25
As we can see, when we used the model to predict the first five images from the test dataset, the model predicted the
following numbers (taking the highest probability): 7, 2, 1, 0, and 4. The actual values were also 7, 2, 1, 0, and 4.
Now that you have explored some machine learning algorithms and how to code them, you are ready to use many more! In
the realm of supervised learning algorithms, we can consider the following (we have seen some of them):
•• Naïve Bayes with Gaussian naïve Bayes algorithm (GNB) and multinomial naïve Bayes (MNB).
K-nearest neighbors.
•• Exhaustive CHAID.
C5.0.
For example, in scikit-learn, most of them are available and well documented:
Throughout our journey, we have utilized various unsupervised learning algorithms like PCA, ICA, Isomap, LLE, and
t-SNE, primarily for dimensionality reduction. Other clustering techniques such as k-means, affinity propagation, mean
shift, and DBSCAN can also be examined. We will discuss some of these algorithms in Section 3.6.
3.6 Unsupervised Machine Learning Algorithms 251
Depending on the specific application, readers might be interested in exploring certain algorithms in more depth, either
through user training or by utilizing pre-trained models:
•• Generative Adversarial Networks (GANs) and their derivatives for generating new data such as images and videos.
BERT (Bidirectional Encoder Representations from Transformers), Transformer, or Generative Pre-training (GPT)
models for natural language processing (NLP) tasks, including language comprehension and unsupervised language
representation.
As evident, there is an abundance of topics to explore. Machine learning continues to evolve as a highly active and inno-
vative domain.
Unsupervised learning is a type of machine learning approach that is utilized when the objective is to discover and extract
meaningful patterns or structure from a given dataset without prior knowledge of the target variable or any labeling. This
technique is particularly helpful in scenarios where the available data is large, complex, or diverse, and there is a need to
gain insights and understanding of the data. Unsupervised learning algorithms can identify similarities, groupings, and
outliers within the data, which can be used to form clusters, reduce dimensionality, or create visualizations that aid in data
exploration.
One common application of unsupervised learning is in the field of data mining, where the objective is to uncover pre-
viously unknown relationships or associations between variables. Another use case is in anomaly detection, where the algo-
rithm can identify unusual patterns or outliers that may indicate errors, fraud, or unusual behavior. Unsupervised learning
algorithms are also used in NLP to identify topics and themes in large textual datasets.
By reducing the size of data, unsupervised learning algorithms can significantly reduce computational requirements and
enhance the efficiency of downstream analysis. This can be particularly beneficial in applications such as image and speech
recognition, where the amount of data can be vast, and the processing power required to analyze the data can be prohibitive.
Most of the techniques for data reduction were developed in the feature extraction section (Chapter 2, Section 2.5). We
explored techniques such as principal component analysis, independent component analysis, locally linear embedding,
t-distributed stochastic neighbor embedding, and manifold learning techniques.
In summary, unsupervised learning is a versatile technique that can aid in data exploration, pattern discovery, clustering,
and dimensionality reduction, and can be used in various domains such as data mining, anomaly detection, and NLP.
Its ability to reduce the size of data can enhance the efficiency of downstream analysis and enable the processing of
large-scale datasets.
3.6.1 Clustering
Clustering aims to group data points into subsets, or clusters, based on similarities in their features or attributes. There are
two main types of clustering: partitional clustering and hierarchical clustering. Partitional clustering involves dividing the
data into a fixed number of clusters, while hierarchical clustering builds a hierarchy of nested clusters by iteratively group-
ing similar clusters into larger ones. Both methods aim to maximize the similarity within clusters while minimizing the
similarity between clusters. The resulting clusters can reveal insights into the underlying structure of the data and can
be used for tasks such as anomaly detection, pattern recognition, and data compression.
Clustering algorithms can be evaluated based on their ability to produce meaningful and useful clusters, as well as their
computational efficiency and scalability to large datasets. Popular clustering algorithms include k-means, hierarchical clus-
tering, and density-based clustering. The choice of algorithm and clustering approach will depend on the specific charac-
teristics of the data and the objectives of the analysis.
252 3 Machine Learning Algorithms
One example of a clustering algorithm that can be used for unsupervised learning is the k-means algorithm. In k-means
clustering, the objective is to partition a given dataset into K non-overlapping clusters, where K is a predetermined value.
Each data point is assigned to only one of the K clusters based on its similarity to the centroid, or center point, of the cluster.
The centroids are iteratively updated until the cluster assignments stabilize, resulting in a final set of clusters. Unlike hier-
archical clustering, k-means clustering does not create a hierarchical structure of nested clusters, and each data point is
assigned to only one cluster. This makes k-means more suitable for datasets with a large number of data points and where
non-overlapping clusters are desired. However, the performance of k-means clustering can be sensitive to the initial place-
ment of the centroids, and it may not work well for datasets with irregular shapes or non-convex clusters. In summary,
the k- means algorithm is a clustering algorithm used for unsupervised learning, which partitions a given dataset into
K non-overlapping clusters. Each data point is assigned to only one cluster based on its similarity to the centroid of the
cluster, and the centroids are updated iteratively until convergence. K-means is suitable for datasets with a large number
of data points and non-overlapping clusters but may not work well for irregular or non-convex datasets.
Hierarchical clustering is a technique that can be performed in two different ways, namely, top-down and bottom-up
clustering. Agglomerative algorithms are examples of bottom-up clustering algorithms. These algorithms start by con-
sidering each data point as an individual cluster and then combine smaller clusters progressively into larger ones. This
results in a hierarchical structure of nested clusters where each cluster consists of subclusters with different levels of
granularity. In contrast, divisive algorithms employ a top-down approach where the entire dataset is considered as
one cluster initially. These algorithms then recursively partition the dataset into smaller and more homogeneous clusters
until each data point is assigned to a separate cluster. Divisive clustering results in a binary tree structure where each node
represents a partition of the data and each leaf node represents a single data point. Both agglomerative and divisive clus-
tering have their advantages and disadvantages, and the choice of algorithm depends on the specific characteristics of the
data and the objectives of the analysis. Agglomerative clustering is more efficient for large datasets with a high number of
data points, whereas divisive clustering is better suited for datasets with a small number of data points or when the num-
ber of clusters is known a priori. In summary, hierarchical clustering can be performed using two different approaches,
top-down and bottom-up clustering. Agglomerative algorithms are examples of bottom-up clustering, whereas divisive
algorithms use a top-down approach. The choice of clustering algorithm depends on the specific requirements of the data
analysis task. In contrast to k-means clustering, hierarchical clustering does not require a predetermined number of clus-
ters as the number of clusters is not known beforehand and is determined based on the similarity or dissimilarity between
data points.
There are several unsupervised machine learning algorithms available for clustering and dimensionality reduction,
including k-means, mini-batch k-means, Ward, and mean shift. These algorithms are typically implemented in a stan-
dardized way, involving data rescaling, instantiation of the estimator, model fitting, cluster assignment (if required),
and algorithm assessment. K-means clustering is a popular algorithm for unsupervised learning that has been extensively
studied and used in various applications. Mini-batch k-means is a variant of k-means that is faster and more scalable,
making it suitable for large datasets. Ward is a hierarchical clustering algorithm that can be used with or without con-
nectivity constraints. Mean shift is another clustering algorithm that iteratively moves a kernel to the local mode of the
distribution, resulting in clusters that are of varying sizes and shapes. Affinity propagation is another unsupervised learn-
ing algorithm that creates clusters by sending messages between pairs of samples until convergence. The algorithm can be
computed based on either the Spearman distance or the Euclidean distance, with the similarity measure computed as the
opposite of the distance or equality. The preference value for all points can be computed as the median, minimum, or
mean value of the similarity values. Affinity propagation is implemented by rescaling the data, computing the similarity
and preference values, performing affinity propagation clustering, and assessing the algorithm’s performance. Another
example is a density-based spatial clustering algorithm (DBSCAN) designed for applications with noise. It is an unsuper-
vised clustering method that identifies core samples of high density and expands clusters from them. The algorithm parti-
tions data into clusters by grouping together neighboring data points that satisfy a density criterion, while data points that
do not belong to any cluster are considered noise. Unlike other clustering algorithms, DBSCAN can identify clusters of
arbitrary shapes that do not need to be convex-shaped. DBSCAN is a deterministic algorithm that produces the same clus-
ters when given the same data in the same order. However, changes to the order of the data may result in different cluster
formations.
3.6 Unsupervised Machine Learning Algorithms 253
3.6.1.1 K-means
K-means clustering aims to partition a set of observations (x1, x2, ..., xn), where each observation is a d-dimensional real
vector, into k (≤ n) clusters S = {S1, S2, ..., Sk} to minimize the within-cluster sum of squares (WCSS) or variance. The objec-
tive is to find the values of Si that minimize the equation:
k k
2
arg min x − μi = arg min Si Var Si
S S
i = 1 x Si i=1
where μi (μi = 1
Si x ) is the centroid or mean of the data points in Si, Si is the size of Si, and is the L2 norm. This is
x Si
equivalent to minimizing the pairwise squared deviations of points within the same cluster:
k
1 2
arg min x−y
S
i=1
Si x , y Si
The total variance is constant, so maximizing the between-cluster sum of squares (BCSS) is equivalent to minimizing the
WCSS. The deterministic relationship between WCSS and BCSS is also related to the law of total variance in probability
theory.
To perform k-means clustering, you need to follow these steps:
1) First, you need to determine the number of clusters (k) you want to create and prepare a training set of examples.
2) Next, you need to randomly select k cluster centroids.
3) Assign each example in the training set to the closest centroid based on a specific distance metric for each fixed set of
centroids.
4) Update the centroids based on the mean of the assigned data points.
5) Keep repeating steps 3 and 4 until convergence is reached, which is typically measured by a threshold for minimum
change in either cluster assignment or centroid location.
To demonstrate k-means clustering in Python, we can use the scikit-learn library. We first need to import the necessary
modules. As you can see, we included the make_blobs() function from sklearn.datasets and the K-means algorithm from
sklearn.cluster. The make_blobs function is used to create a synthetic dataset, and the K-means function can be used to
perform clustering on this dataset. The make_blobs function is used to generate a dataset of 2000 samples, with 5 centers,
and a standard deviation of 1.5 for each cluster. The random_state parameter is set to 42 to ensure that the same dataset is
generated each time the code is run. The next line sets the title of the scatter plot to “Data points.” The final line creates the
scatter plot of the dataset using the scatter function from Matplotlib. The first argument of the scatter function (data[0][:,0])
is the x-coordinates of the data points, and the second argument (data[0][:,1]) is the y-coordinates. The edgecolors parameter
is set to “black” to specify the color of the edge of each point, and the linewidths parameter is set to 0.5 to specify the thick-
ness of the edge. The show function is called to display the plot.
Input:
plt.show()
Output:
Data points
15
10
–5
–10
–10 –5 0 5
The next Python code uses the scikit-learn library to perform k-means clustering on a synthetic dataset generated by the
make_blobs function and generates a scatter plot of the data points color-coded by cluster. The first line creates a K_Means
object with five clusters. The n_clusters parameter specifies the number of clusters to form. The second line fits the k-means
algorithm to the data using the fit method of the K_Means object. This step assigns each data point to a cluster based on their
proximity to the centroid of the cluster. The third line predicts the cluster labels for each data point using the predict method
of the K_Means object. The fourth line sets the title of the scatter plot to “Data points in clusters.” The final line creates the
scatter plot of the dataset using the scatter function from Matplotlib.
Input:
# Training
K_Means.fit(data[0])
# Make predictions
clusters = K_Means.predict(data[0])
plt.show()
3.6 Unsupervised Machine Learning Algorithms 255
Output:
10
–5
–10
–10 –5 0 5
Overall, this code demonstrates how to perform k-means clustering on a synthetic dataset using scikit-learn and visualize
the clusters using a scatter plot.
import pandas as pd
from sklearn.cluster import MiniBatchKMeans
import numpy as np
import seaborn as sns
data = pd.read_csv("../data/datasets/housing.csv")
data = data.loc[:, ["median_income", "latitude", "longitude"]]
data.head()
256 3 Machine Learning Algorithms
Output:
Median_income Latitude Longitude
Then, we define a MiniBatchKMeans object called MiniBatch with several parameters. The n_clusters parameter specifies
the number of clusters to be created (K = 6). The batch_size parameter sets the size of each random batch of data to be used
during each iteration of the algorithm (batch_size = 6). Next, the code creates a new column called “Cluster” in the data
DataFrame by using the fit_predict method of the K_Means object. This method fits the model to the data and assigns each
data point to a cluster. The resulting cluster labels are stored in the “Cluster” column. Then, the code converts the “Cluster”
column to an integer data type using the astype method. The next few lines of code set some plotting parameters using the
plt.style.use, plt.rc, and sns.relplot methods. The resulting plot provides a visual representation of the clusters and can be
used to gain insights into the data.
Input:
plt.style.use('seaborn-colorblind')
plt.rc("figure", autolayout=True)
plt.rc("axes", labelsize='large', titlesize=10, titlepad=10)
sns.relplot(x='longitude', y='latitude', hue='Cluster', data=data, height=6)
plt.show()
Output:
42
40
38 Cluster
Latitude
0
1
2
3
36 4
5
34
• For every data point, compute the mean of all points encompassed within a specified radius, or “kernel,” centered at
the data point.
A key advantage of mean-shift clustering is the elimination of a priori specification of cluster numbers. For instance, the
k-means algorithm entails specifying the number of clusters (k) and subsequently identifying the most suitable cluster for
each data instance. However, determining an appropriate initial value for k can be challenging, as k may range from 1 to the
total number of data instances. The pursuit of identifying the optimal number of clusters remains an active area of research,
with various techniques available, though their success varies with increasing data dimensionality. Additionally, mean shift
refrains from making assumptions regarding data distribution and accommodates clusters of varying shapes and sizes.
However, the algorithm’s performance is susceptible to kernel choice and kernel radius. In contrast to the widely employed
k-means clustering algorithm, the mean-shift technique does not necessitate a predetermined cluster count. Instead, the
algorithm inherently ascertains the optimal number of clusters based on the inherent structure of the data. The mean-shift
technique is grounded in the principles of kernel density estimation (KDE). Envision the dataset as originating from a
probability distribution. KDE serves as a method for approximating the underlying distribution, also known as the prob-
ability density function, associated with a dataset. This is achieved by positioning a kernel, or a weight function commonly
utilized in convolution, on each data point. Numerous kernel types exist, with the Gaussian kernel being the most prevalent:
1 2
e−2 x
1
K x = d 2
2π
where x is a set of observations (x1, x2, ..., xn), where each observation is a d-dimensional real vector. The summation of
individual kernels yields a probability surface, exemplified by a density function. The resulting density function is contin-
gent upon the chosen kernel bandwidth parameter.
For a clearer understanding, let us examine synthetic data.
Input:
plt.show()
258 3 Machine Learning Algorithms
Output:
Data points
15
10
–5
–10
–10 –5 0 5
The point density within a cluster reaches its maximum in the vicinity of the cluster’s centroid. By generalizing this asser-
tion, one can infer that the probable center of any cluster can be determined by examining the point density at specific
locations in the given diagram. Consequently, the number of clusters can be ascertained, along with the approximate centers
of the identified clusters. The mean shift clustering algorithm operates by identifying the “mode” of the density and asses-
sing its highest points. It iteratively shifts data points toward the nearest mode, ultimately yielding a set of clusters and
facilitating sample-to-cluster assignments upon completion of the fitting process.
Consider a set of observations (x1, x2, ..., xn), where each observation is a d-dimensional real vector. Additionally, assume
the selection of a kernel K with a bandwidth parameter h. The bandwidth parameter plays a crucial role as it delineates a
region surrounding the samples, within which the mean shift algorithm should investigate to ascertain the most plausible
trajectory based on density estimation. However, the determination of an appropriate bandwidth value remains a pertinent
question. Employing this set of observations and kernel function, the subsequent kernel density estimator for the entire
population’s density function can be obtained as follows:
n
1 x − xi
fK x = = K
nhd i=1
h
Input:
# Instantiate and fit the Mean Shift model using the estimated bandwidth
meanshift = MeanShift(bandwidth=bandwidth)
meanshift.fit(X)
3.6 Unsupervised Machine Learning Algorithms 259
# Retrieve the labels for each data point and find the unique labels
labels = meanshift.labels_
labels_unique = np.unique(labels)
# Define colors for each cluster label and generate a list of colors for the plot
colors = list(map(lambda x: 'red' if x == 1 else 'blue' if x == 2 else 'green' if x == 3
else 'orange', Pred))
# Create a scatter plot of the data points with colors based on their cluster
assignments
plt.scatter(X[:,0], X[:,1], c=colors, marker="o", picker=True)
plt.show()
Output:
15
10
–5
–10
–10 –5 0 5
As we can see in the code above, the estimate_bandwidth function plays a crucial role, as it calculates the optimal band-
width tailored to the specific dataset under consideration. Subsequently, the estimated bandwidth is utilized during the
instantiation of the mean shift algorithm. Following this, the data are fit to the model, and pertinent information such
as the number of labels is derived.
Consider a dataset D, which is comprised of data points d1, d2, ..., dn. Let s represent an N × N matrix, wherein s(i, j)
signifies the similarity between data points di and dj. The negative squared distance between two data points is utilized
as s; for instance, for points xi and xj, s(i, j) is equal to −‖xi −xj‖2.
The diagonal of the matrix s, specifically s(i, i), holds particular importance as it denotes the input preference, which
reflects the probability of a given input serving as an exemplar. When initialized to a uniform value for all inputs, it dictates
the number of classes generated by the algorithm. A value approximating the minimum possible similarity results in fewer
classes, while a value that is near or surpasses the maximum possible similarity leads to an increased number of classes.
Typically, the median similarity of all input pairs is employed as the initial value.
The algorithm progresses through alternating between two message-passing stages, leading to the modification of two
matrices:
• The “responsibility” matrix R encompasses values r(i, k), which quantify the degree to which xk is apt to act as the exem-
plar for xi in comparison to alternative candidate exemplars for xi. The responsibility matrix is initialized to contain only
zeros and updates are disseminated throughout the system:
r i, k s i, k − max a i, k + s i, k
k k
• The “availability” matrix A comprises values a(i, k), which convey the extent of “appropriateness” in xi selecting xk as its
exemplar, considering the preferences of other data points for xk as an exemplar. The availability matrix is also initialized
to contain only zeros and updated as follows:
The iterative process continues until either the cluster boundaries exhibit consistency across multiple iterations or a pre-
determined iteration count is achieved. Exemplars are derived from the ultimate matrices, identified by a positive combined
value of “responsibility and availability” for themselves (specifically, (r(i, i) + a(i, i)) > 0).
Let us show an example in Python following the same procedure as above. After generating synthetic data, the code set up
the affinity propagation algorithm by creating an instance with a preference value of −50 and fit the model to the input
data X.
Input:
plt.show()
3.6 Unsupervised Machine Learning Algorithms 261
Output:
Data points
–1
–2
–2 –1 0 1 2
Input:
# Plot exemplars
# Close any existing plots, create a new figure, and clear the figure
plt.close('all')
plt.figure(1)
plt.clf()
# Define a cyclic color sequence to be used in plotting the clusters.
colors = cycle('bgrcmykbgrcmykbgrcmykbgrcmyk')
# Iterate through the range of the number of clusters and the color sequence
for k, col in zip(range(n_clusters_), colors):
class_members = labels == k
cluster_center = X[cluster_centers_indices[k]]
plt.plot(X[class_members, 0], X[class_members, 1], col + '.')
plt.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,
markeredgecolor='k', markersize=14)
262 3 Machine Learning Algorithms
for x in X[class_members]:
plt.plot([cluster_center[0], x[0]], [cluster_center[1], x[1]], col)
# Display the plot showing the clusters, their centers, and the connections between
the data points and their respective cluster centers.
plt.show()
Output:
–1
–2
–2 –1 0 1 2
Input:
# Generate a synthetic dataset of circles with 1000 samples, a scale factor of 0.3,
and noise level of 0.1
X, labels_true = make_circles(n_samples=1000, factor=0.3, noise=0.1)
# Set the number of clusters for K-means and perform the clustering
clusters_kmeans = 2
kmeans = KMeans(n_clusters=clusters_kmeans)
y_k = kmeans.fit_predict(X)
# Perform DBSCAN clustering with an epsilon value of 0.3 and minimum samples of 10
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
Output:
K-means with chosen number of clusters: 2 DBSCAN with Estimated number of clusters: 2
Clusters1
2 Clusters2 2
Centroids
1 1
0 0
–1 –1
–2 –2
–2 –1 0 1 2 –2 –1 0 1 2
As stated in Chapter 1 and thanks to great open-source frameworks such as scikit-learn and Keras, we can create pipelines
with hephAIstos for classification purposes using both CPUs and GPUs.
If we use CPUs, the classification_algorithms parameter can be chosen with the following options:
– gnb
– mnb
– kneighbors
◦ For k-neighbors, we need to add an additional parameter to specify the number of neighbors (n_neighbors).
– sgd
– nearest_centroid
– decision_tree
– random_forest
◦ For random_forest, we can optionally add the number of estimators (n_estimators_forest).
– extra_trees
◦ For extra_trees, we add the number of estimators (n_estimators_forest).
– mlp_neural_network
◦ The following parameters are available: max_iter, hidden_layer_sizes, activation, solver, alpha, learning_rate,
learning_rate_init.
◦ max_iter: The maximum number of iterations (default = 200).
◦ hidden_layer_sizes: The ith element represents the number of neurons in the ith hidden layer.
◦ mlp_activation: The activation function for the hidden layer (“identity,” “logistic,” “relu,” “softmax,” “tanh”; default
= “relu”).
◦ solver: The solver for weight optimization (“lbfgs,” “sgd,” “adam”; default = “adam”).
◦ alpha: Strength of the l2 regularization term (default = 0.0001).
◦ mlp_learning_rate: Learning rate schedule for weight updates (“constant,” “invscaling,” “adaptive”; default =
“constant”).
◦ learning_rate_init: The initial learning rate used (for sgd or Adam). It controls the step size in updating the weights.
– mlp_neural_network_auto: This option allows us to find the optimal parameters for the neural network
◦ For each classification algorithm, we also need to add the number of k-folds for cross-validation (cv).
Here is an example:
# Run ML Pipeline
ml_pipeline_function(df, output_folder = './Outputs/', missing_method =
'row_removal', test_size = 0.2, categorical = ['label_encoding'],features_label =
['Target'], rescaling = 'standard_scaler', classification_algorithms=
['svm_rbf','lda', 'random_forest', 'gpu_logistic_regression'], n_estimators_forest =
100, gpu_logistic_activation = 'adam', gpu_logistic_optimizer = 'adam',
gpu_logistic_epochs = 50, cv = 5)
The above code will print the steps of the processes and provide the metrics of our models such as the following:
266 3 Machine Learning Algorithms
• gpu_rnn: Recurrent neural network for classification. We need to set the following parameters:
– rnn_units: A positive integer, the dimensionality of the output space
– rnn_activation: The activation function to use (softmax, sigmoid, linear, or tanh)
– rnn_optimizer: The optimizer (adam, sgd, RMSprop)
– rnn_loss: The loss function such as the mean squared error (“mse”), the binary logarithmic loss (“binary_crossen-
tropy”), or the multi-class logarithmic loss (“categorical_crossentropy”)
– rnn_epochs: The number of epochs (integer)
Let us view an example:
# Run ML Pipeline
ml_pipeline_function(df, output_folder = './Outputs/', missing_method =
'row_removal', test_size = 0.2, categorical = ['label_encoding'],features_label =
['Target'], rescaling = 'standard_scaler', classification_algorithms=
['svm_rbf','lda', 'random_forest', 'gpu_logistic_regression'], n_estimators_forest =
100, gpu_logistic_activation = 'adam', gpu_logistic_optimizer = 'adam',
gpu_logistic_epochs = 50, cv = 5)
The above code will print the steps of the processes and provide the metrics of our models such as the following:
3.7 Machine Learning Algorithms with HephAIstos 267
# Run ML Pipeline
ml_pipeline_function(df, output_folder = './Outputs/', missing_method =
'row_removal', test_size = 0.2, categorical = ['label_encoding'],features_label =
['Target'], rescaling = 'standard_scaler', classification_algorithms=
['svm_rbf','lda', 'random_forest', 'gpu_logistic_regression'], n_estimators_forest =
100, gpu_logistic_optimizer = SGD(learning_rate = 0.001), gpu_logistic_epochs = 50,
cv = 5)
In addition, and with the same philosophy, regression algorithms are also available for both CPUs and GPUs:
– gpu_rnn_regression: Recurrent neural network for regression. We need to set the following parameters:
◦ rnn_units: A positive integer, the dimensionality of the output space
◦ rnn_activation: The activation function to use (softmax, sigmoid, linear, or tanh)
◦ rnn_optimizer: The optimizer (adam, sgd, RMSprop)
◦ rnn_loss: The loss function such as the mean squared error (“mse”), the binary logarithmic loss (“binary_crossen-
tropy”), or the multi-class logarithmic loss (“categorical_crossentropy”)
◦ rnn_epochs: The number of epochs (integer)
Let us view another example:
# Run ML Pipeline
ml_pipeline_function(df, output_folder = './Outputs/', missing_method =
'row_removal', test_size = 0.2, categorical = ['label_encoding'],features_label =
['Target'], rescaling = 'standard_scaler', regression_algorithms=
['linear_regression','svr_linear', 'svr_rbf', 'gpu_linear_regression'],
gpu_linear_epochs = 50, gpu_linear_activation = 'linear', gpu_linear_learning_rate =
0.01, gpu_linear_loss = ‘mse’)
The above code will print the steps of the processes and provide the metrics of our models such as the following:
# Load data
DailyDelhiClimateTrain = './data/datasets/DailyDelhiClimateTrain.csv'
df = pd.read_csv(DailyDelhiClimateTrain, delimiter=',')
# Run ML Pipeline
ml_pipeline_function(df, output_folder = './Outputs/', missing_method =
'row_removal', test_size = 0.2, rescaling = 'standard_scaler', regression_algorithms=
['gpu_rnn_regression'], rnn_units = 500, rnn_activation = 'tanh' , rnn_optimizer =
'RMSprop', rnn_loss = 'mse', rnn_epochs = 50)
To use CNNs, we can include the conv2d option, which will apply a two-dimensional CNN using GPUs if they are avail-
able. The parameters are the following:
• conv_kernel_size: The kernel_size is the size of the filter matrix for the convolution (conv_kernel_size ×
conv_kernel_size)
•• conv_activation: The activation function to use (softmax, sigmoid, linear, relu, or tanh)
conv_optimizer: The optimizer (adam, sgd, RMSprop)
• conv_loss: The loss function such as the mean squared error (“mse”), the binary logarithmic loss (“binary_crossentropy”),
or the multi-class logarithmic loss (“categorical_crossentropy”)
import tensorflow as tf
from keras.datasets import mnist
from tensorflow.keras.utils import to_categorical
df = mnist.load_data()
(X, y), (_,_) = mnist.load_data()
(X_train, y_train), (X_test, y_test) = df
# Here we reshape the data to fit model with X_train.shape[0] images for training,
image size is X_train.shape[1] x X_train.shape[2]
# 1 means that the image is greyscale.
X_train = X_train.reshape(X_train.shape[0],X_train.shape[1],X_train.shape[2],1)
X_test = X_test.reshape(X_test.shape[0],X_test.shape[1],X_test.shape[2],1)
X = X.reshape(X.shape[0],X.shape[1],X.shape[2],1)
References
Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems
2: 303–314. https://doi.org/10.1007/BF02551274.
Henseler, J., Ringle, C., and Sinkovics, R. (2009). The use of partial least squares path modeling in international marketing.
Advances in International Marketing (AIM) 20: 277–320.
Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural Computation 9 (8): 1735–1780.
Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks. Neural Network 4 (2): 251–257. https://doi.org/
10.1016/0893-6080(91)90009-T.
Le Cun, Y., Boser, B., Denker, J.S. et al. (1989). Handwritten digit recognition with a back-propagation network. In: Proceedings of
the 2nd International Conference on Neural Information Processing Systems (NIPS’89), pp. 396–404. MIT Press.
Minsky, M. and Papert, S. (1969). Perceptrons: An Introduction to Computational Geometry. MIT Press.
Parker, D.B. (1985). Learning Logic. Tech. Rep. MIT.
Rumelhart, D.E., Hinton, G.E., and Williams, R.J. (1986). Learning internal representations by error propagation. In: Parallel
Distributed Processing: Explorations in the Microstructure of Cognition, vol. 1, pp. 318–362. Foundations. MIT Press.
Werbos, P.J. (1974). Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. Doctoral Dissertation,
Boston, MA: Applied Mathematics, Harvard University.
Further Reading
de Amorim, R.C. (2015). Feature relevance in ward’s hierarchical clustering using the Lp norm. Journal of Classification 32: 46–62.
https://doi.org/10.1007/s00357-015- 9167-1.
de Amorim, C.R. and Mirkin, B. (2012). Minkowski metric, feature weighting and anomalous cluster initializing in K-Means
clustering. Pattern Recognition 45: 1061–1075. https://doi.org/10.1016/j.patcog.2011.08.012.
de Amorim, R.C. and Hennig, C. (2015). Recovering the number of clusters in data sets with noise features using feature rescaling
factors. Information Science 324: 126–145. https://doi.org/10.1016/j.ins.2015.06.039.
Ben-Hur, A., Horn, D., Siegelmann, H., and Vapnik, V.N. (2001). Support vector clustering. Journal of Machine Learning Research
2: 125–137.
Berwick, R. An idiot’s guide to support vector machines (SVMs). Village Idiot. https://web.mit.edu/6.034/wwwbob/svm-notes-
long-08.pdf (accessed 2022).
Boser, B.E., Guyon, I.M., and Vapnik, V.N. (1992). A training algorithm for optimal margin classifiers. In: Proceedings of the Fifth
Annual Workshop on Computational Learning Theory – COLT ’92, 144. CiteSeerX 10.1.1.21.3818. Association for Computing
Machinery. https://doi.org/10.1145/130385.130401. S2CID 207165665.
Besse, P. Neural Networks and Introduction to Deep Learning, wikistat, https://www.math.univ-toulouse.fr/ besse/Wikistat/pdf/
st-m-hdstat-rnn-deep-learning.pdf (accessed 2022).
Chang, C.-C. and Lin, C.-J. (2011). LIBSVM : A Library for Support Vector Machines. ACM Transactions on Intelligent Systems and
Technology, 2: 27: 1–27: 27. http://www.csie.ntu.edu.tw/~cjlin/libsvm.
Cheng, Y. (1995). Mean shift, mode seeking, and clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence 17:
790–799. https://doi.org/10.1109/34.400568.
Comaniciu, D. and Meer, P. (2002). Mean shift: a robust approach toward feature space analysis. IEEE Transactions on Pattern
Analysis and Machine Intelligence 24: 603–619. https://doi.org/10.1109/34.1000236.
Cortes, C. and Vapnik, V.N. (1995). Support-vector networks. Machine Learning 20 (3): 273–297. CiteSeerX 10.1.1.15.9362. https://
doi.org/10.1007/BF00994018. S2CID 206787478.
Ding, C. and He, X. (2004). K-means clustering via principal component analysis. In: Proceedings of the 21st International
Conference on Machine Learning, Banff, Alberta, Canada. New York, NY, USA: Association for Computing Machinery, 29.
https://doi.org/10.1145/1015330.1015408.
Drucker, H., Burges, C.J.C., Kaufman, L. et al. (1997). Support vector regression machines. Advances in Neural Information
Processing Systems 9: 155–161.
Duda, R.O. (2001). Pattern Classification, 2e. Wiley.
Elman, J.L. (1990). Finding structure in time. Cognitive Science 14 (2): 179–211.
Further Reading 271
Vattani, A. (2011). K-means requires exponentially many iterations even in the plane. Discrete & Computational Geometry
45: 596–616. https://doi.org/10.1007/s00454-011-9340-1.
Vlasblom, J. and Wodak, S.J. (2009). Markov clustering versus affinity propagation for the partitioning of protein interaction
graphs. BMC Bioinformatics 10: 99. https://doi.org/10.1186/1471-2105-10-99.
Ward, J.H. (1963). Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association
58: 236–244. https://doi.org/10.1080/01621459.1963. 10500845.
https://en.wikipedia.org/wiki/K-means_clustering
https://github.com/kavyagajjar/Clustering/blob/main/DBSCAN/Cluster_Analysis_with_DBSCAN.ipynb
https://machinelearningmedium.com/2017/08/11/cost-function-of-linear-regression/
https://medium.com/analytics-vidhya/logistic-regression-b35d2801a29c
https://medium.com/geekculture/installing-cudnn-and-cuda-toolkit-on-ubuntu-20-04-for-machine-learning-tasks-f41985fcf9b2
https://python3.foobrdigital.com/machine-learning-algorithms/
https://python-bloggers.com/2022/03/multiple-linear-regression-using-tensorflow/
https://towardsai.net/p/machine-learning/logistic-regression-with-mathematics
https://towardsdatascience.com/a-gentle-introduction-to-gradient-descent-thru-linear-regression-fb0fc86482a3
https://towardsdatascience.com/gentle-dive-into-math-behind-convolutional-neural-networks-79a07dd44cf9
https://towardsdatascience.com/gradient-descent-algorithm-a-deep-dive-cf04e8115f21
https://www.analyticsvidhya.com/blog/2020/02/mathematics-behind-convolutional-neural-network/
https://www.analyticsvidhya.com/blog/2021/08/conceptual-understanding-of-logistic-regression-for-data-science-beginners/
https://www.analyticsvidhya.com/blog/2021/08/understanding-linear-regression-with-mathematical-insights/
https://www.geeksforgeeks.org/affinity-propagation-in-ml-to-find-the-number-of-clusters/
https://www.geeksforgeeks.org/gradient-descent-in-linear-regression/
https://www.geeksforgeeks.org/linear-regression-using-tensorflow/
https://www.geeksforgeeks.org/ml-mean-shift-clustering/
https://www.hindawi.com/journals/cmmm/2021/8500314/
https://www.kaggle.com/datasets/zalando-research/fashionmnist
https://www.math.univ-toulouse.fr/ besse/Wikistat/pdf/st-m-hdstat-rnn-deep-learning.pdf
273
Natural language processing (NLP) refers to the branch of artificial intelligence (AI) focused on giving computers the ability
to understand text and spoken words. It combines computational linguistics with statistical, machine learning, and deep
learning models. The main idea is to process human language in the form of text or voice data. NLP has many applications
such as understanding the intent and sentiment of a speaker or a writer, translating text from one language to another,
responding to spoken commands, or summarizing large volumes of text. NLP can be found in voice-operated global posi-
tioning systems (GPS), digital assistants, speech-to-text services, chatbots, named entity recognition, sentiment analysis, and
text generation. Understanding human language is not an easy task, as a dialog contains many ambiguities such as sar-
casms, metaphors, variations, and homonyms. When we say “understanding,” we need to be clear about the lack of con-
sciousness of a machine. Alan Turing decided to dismiss these types of inquiries and ask a simple question: Can a computer
talk like a human? To answer this question, he imagined the famous Turing Test in a paper published in 1950. A program
called ELIZA subsequently succeeded in misleading people by mimicking a psychologist.
Python provides a wide range of tools and libraries for processing natural language. Popular ones are the Natural Lan-
guage Toolkit (NLTK), which is an open-source collection of libraries for building NLP programs, and SpaCy, which is a free
open-source library for advanced NLP. We can find routines for sentence parsing, word segmentation, tokenization, and
other purposes.
Machine Learning Theory and Applications: Hands-on Use Cases with Python on Classical and Quantum Machines, First Edition. Xavier Vasques.
© 2024 John Wiley & Sons, Inc. Published 2024 by John Wiley & Sons, Inc.
274 4 Natural Language Processing
A few concepts need to be understood before exploring coding of NLP. We have mentioned tokenization, which refers to a
sequence of characters in a document that is used for analytical purposes. In general, we need effective tokens to allow our
program to process data. Tokens should be stored in an iterable data structure such as a list or generator to facilitate the
analysis or should be free of non-alphanumeric characters. For example, the following list represents tokens:
['My', 'name', 'is', 'Xavier', 'Vasques', 'and', 'I', 'speak', 'about', 'NLP']
When we have our tokens, it is easier to perform some basic analytics such as a simple word count. We can also imagine
the need to create a pandas DataFrame containing some features such as the number of documents in which a token
appears, a count of appearances of a token, or the rank of a token relative to other tokens. If we are in a context in which
we desire to analyze emails and predict which ones are spam and which are not, we can use historical data and tag the emails
that are spam in order to run classification algorithms such as KNN. We will need to convert our text into numerical data
(vector format). The bag-of-words method is an approach in which each unique word in a text is represented by a number. In
general, there is a need to clean our dataset by removing elements such as punctuation or common words (“a,” “I,” “the,”
etc.). NLTK includes punkt, providing a pre-trained tokenizer for English, averaged_perceptron_tagger, providing a pre-
trained parts-of-speech tagger for English, and stopwords, which is a list of 179 English stop words such as “I,” “a,” and
“the.” Another concept often seen in NLP is “stemming,” which consists of reducing words to the root form. If we consider
the word “action,” we can find in a text the following words: “action,” “actionable,” “actioned,” “actioning,” and so on. This
needs to be carefully studied, as we can be in a situation of over-stemming or under-stemming. In the case of over-stemming,
two different stems are integrated such as “univers” for “universe,” “universal,” or “university.” In the case of under-
stemming, two words are not stemmed even though they have the same root (e.g., “alumnus,” “alumni,” “alumnae”). When
we need to tag each word (noun, verb, adjective, etc.), we consider parts-of-speech tagging and named entity recognition to
obtain textual mentions of named entities (person, place, organization, etc.).
There are a few approaches to work on NLP, such as symbolic, statistical, and neural NLP. The symbolic approach refers
to a hand-coded set of rules associated with dictionary lookup, which is different from the approaches of systems that can
learn rules automatically. NLP applications face some challenges, such as handling increasing volumes of text and voice
data or difficulty in scaling. The symbolic approach can be used in combination with others or when the training dataset
is not large enough. Statistical NLP refers to the use of statistical inferences to learn rules automatically by analyzing large
corpora. Many machine learning algorithms have been applied to handle natural language by examining a large set of fea-
tures. The challenge with this approach is often the requirement for elaborate feature engineering. Deep learning
approaches such as those based on convolutional neural networks and recurrent neural networks enable NLP systems
to process large volumes of raw, unstructured, and unlabeled text and voice datasets.
In NLP, there are two important terms to know: natural language understanding (NLU), which is the process of making
the NLP system understand a natural language input, and natural language generation (NLG), which is the process of pro-
ducing sentences in the form of natural language that makes sense. NLU is composed of several steps including lexical
analysis, syntactic analysis, semantic analysis, discourse analysis, and pragmatic analysis. Like NLU, NLG also contains
several procedural steps such as discourse generation, sentence planning, lexical choice, sentence structuring, and morpho-
logical generation.
Let us start a classic NLP use case, described in Kaggle and using the UCI datasets, which include a collection of more than
5000 SMS phone messages. The objective of this use case is to train a machine learning model on labeled data and use it to
classify unlabeled messages as spam or ham.
The dataset we will use is composed of 3375 SMS ham messages randomly chosen from 10,000 legitimate messages collected
at the Department of Computer Science at the National University of Singapore and 450 SMS ham messages collected from a
PhD thesis. The dataset also contains 425 SMS spam messages manually extracted from the Grumbletext Web site as well as
1002 SMS ham messages and 322 spam messages from the SMS Spam Corpus.
4.1 Classifying Messages as Spam or Ham 275
After installing NLTK (using pip install, for example), we import NLTK.
Input:
# Import libraries
import nltk
print(len(messages))
Output:
5574
print('\n')
Output:
0 ham Go until jurong point, crazy.. Available only in bugis n great world la e buffet...
Cine there got amore wat...
2 spam Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to
receive entry question(std txt rate)T&C's apply 08452810075over18's
4 ham Nah I don't think he goes to usf, he lives around here though
We will use pandas, Matplotlib, and Seaborn to manipulate and visualize data.
276 4 Natural Language Processing
Input:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Use read_csv
messages = pd.read_csv('../data/datasets/SMSSpamCollection', sep='\t', names=
['label', 'message'])
messages.head()
Output:
Label Message
We can start a descriptive analysis of the data and visualize data through plots.
Input:
messages.describe()
messages.groupby('label').describe()
Output:
Message
Label
Let us add a new feature called length, telling us how long the text messages are.
Input:
# We add a new feature called length telling us how long the text messages are
messages['length'] = messages['message'].apply(len)
messages.head()
4.1 Classifying Messages as Spam or Ham 277
Output:
Label Message Length
Now, we will plot the frequency versus the length of the messages.
Input:
messages.length.describe()
Output:
count 5572.000000
mean 80.489950
std 59.942907
min 2.000000
25% 36.000000
50% 62.000000
75% 122.000000
max 910.000000
Name: length, dtype: float64
250
200
Frequency
150
100
50
0
0 200 400 600 800
Let us be curious and read the message with 910 characters as well as the one with only two.
Input:
Output:
"For me the love should start with attraction.i should feel that I need her every time around me.
she should be the first thing which comes in my thoughts.I would start the day and end it with her.
she should be there every time I dream.love will be then when my every breath has her name.my life
should happen around her.my life will be named to her.I would cry for her.will give all my
happiness and take all her sorrows.I will be ready to fight with anyone for her.I will be in love
when I will be doing the craziest things for her.love will be when I don't have to proove anyone
that my girl is the most beautiful lady on the whole planet.I will always be singing praises
for her.love will be when I start up making chicken curry and end up makiing sambar.life will be
the most beautiful then.will get every morning and thank god for the day because she is with me.
I would like to say a lot..will tell later.."
'Ok'
Let us check whether length could be a useful feature to distinguish ham and spam messages.
Input:
Output:
Ham Spam
700 70
600 60
500 50
400 40
300 30
200 20
100 10
0 0
0 200 400 600 800 50 100 150 200
Input:
import string
from nltk.corpus import stopwords
def text_process(mess):
Output:
Input:
bag_transformer = CountVectorizer(analyzer=text_process).fit(messages['message'])
# Transform bag-of-words
messages_bag = bag_transformer.transform(messages['message'])
# TF-IDF
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer().fit(messages_bag)
messages_tfidf = tfidf_transformer.transform(messages_bag)
print(messages_tfidf.shape)
280 4 Natural Language Processing
Output:
11425
Matrix Shape: (5572, 11425)
Non-zero occurrences: 50548
(5572, 11425)
As we can see above, we have 11,425 columns (unique words) and 5572 rows (messages). This matrix is called the bag
of words.
It is now time to train a model and apply it to the test SMS dataset (20% of the entire dataset) to predict whether SMS
messages are ham or spam. We will use naive Bayes with scikit-learn.
Input:
# Split dataset
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = \
train_test_split(messages_tfidf, messages['label'], test_size=0.2)
# Model fit
spam_detect_model = MultinomialNB().fit(X_train, y_train)
Output:
# Split dataset
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = \
train_test_split(messages_tfidf, messages['label'], test_size=0.2)
# Model fit
spam_detect_model = svm.SVC(kernel='linear').fit(X_train, y_train)
Output:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import tweepy
from textblob import TextBlob
282 4 Natural Language Processing
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from wordcloud import WordCloud
import json
from collections import Counter
We can now define the search keyword (artificial intelligence) and the number of tweets to obtain the last 5000. This may
take some time.
Input:
Now, we can start some routines for sentiment analysis. The most common method is to analyze whether the tweets are
generally positive, negative, or neutral.
Input:
Output:
Next, we will create a Dataframe of tweets to facilitate operations and clean the searched tweets.
Input:
my_list_of_dicts = []
for each_json_tweet in searched_tweets:
my_list_of_dicts.append(each_json_tweet._json)
my_demo_list = []
with open('tweet_json_data.txt', encoding='utf-8') as json_file:
all_data = json.load(json_file)
for each_dictionary in all_data:
tweet_id = each_dictionary['id']
text = each_dictionary['text']
favorite_count = each_dictionary['favorite_count']
retweet_count = each_dictionary['retweet_count']
created_at = each_dictionary['created_at']
my_demo_list.append({'tweet_id': str(tweet_id),
'text': str(text),
'favorite_count': int(favorite_count),
'retweet_count': int(retweet_count),
'created_at': created_at,
})
Let us now check the shape and the first 10 tweets of our Dataframe.
Input:
tweet_dataset.shape
Output:
(5000, 5)
284 4 Natural Language Processing
Input:
tweet_dataset.head()
Output:
tweet_id text favorite_count retweet_count created_at
0 1599404573386412032 RT @RealDMitchell: My @ObsNewReview column tod... 0 1 Sun Dec 04 14:05:40 +0000 2022
1 1599404503803236352 @renoomokri Peter Obi’s Manifesto is the most... 0 0 Sun Dec 04 14:05:23 +0000 2022
2 1599404474904514560 My @ObsNewReview column today is about surveys... 3 1 Sun Dec 04 14:05:16 +0000 2022
3 1599404456277577733 RT @Nilofer_tweets: Harvard University is offe... 0 12 Sun Dec 04 14:05:12 +0000 2022
return input_txt
tweet_dataset['text'] = np.vectorize(remove_pattern)(tweet_dataset['text'],
"@[\w]*")
Input:
corpus = []
for i in range(0, 1000):
tweet = re.sub('[^a-zA-Z0-9]', ' ', tweet_dataset['text'][i])
tweet = tweet.lower()
tweet = re.sub('rt', “, tweet)
tweet = re.sub('http', “, tweet)
tweet = re.sub('https', “, tweet)
tweet = tweet.split()
ps = PorterStemmer()
tweet = [ps.stem(word) for word in tweet if not word in set(stopwords.words
('english'))]
tweet = ' '.join(tweet)
corpus.append(tweet)
plt.figure(figsize=(10, 7))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis('off')
plt.show()
Output:
Output:
[('googl offer free onlin cours ton field comput scienc aifici intellig 10 free c', 134),
('aifici intellig end 14 year long debat co pshvomgkw2', 32), ('dboss imag creat use aifici
intellig imagin dboss role forest tribe save ani', 28), ('dboss imag creat use aifici intellig
dboss role save anim forest dbo', 13), ('aifici intellig python cookbook recip appli ai
algorithm deeplearn techniqu use ten', 13), ('4 trust current leader deliv us fair equal access
1 anti age technolog 2 aifici intelligen', 13), ('harvard univers offer free cours aifici
intellig python beginn friendli 100 free', 12), ('thread stand close person whose covid passpo
code red 10 minut aifici intellig', 12), ('see alreadi done aifici intellig believ eventu
end displac hundr', 12), ('mani new crypto launch happen everyday find instantli social
trend grow holder', 10)]
286 4 Natural Language Processing
Interestingly, the 5000 tweets we have analyzed are more negative than positive about AI. In the WordCloud image, we
see words and phrases such as science, free, computer science, role, social media, machine learning, google, code, python,
debate, use, research, and others.
In recent years, significant advancements in NLP have been observed, commencing in 2018 with the introduction of two
large-scale deep learning models: Generative Pre-Training (GPT) and Bidirectional Encoder Representations from Trans-
formers (BERT), which encompass BERT-Base and BERT-Large. Contrary to earlier NLP models, BERT is an open-source,
highly bidirectional, unsupervised language representation that is pre-trained solely on plain text corpora. This period also
witnessed the emergence of other substantial deep learning language models such as GPT-2, RoBERT, ESIM+GloVe, and
GPT-3, fueling extensive technological discourse.
Foundation models (FMs) are defined as expansive AI models trained on large-scale, unlabeled datasets, typically
through self-supervised learning. This process results in a model capable of adapting to various downstream tasks.
Alternative terms for foundation models include Large Models, Large Language Models (LMs/LLMs), State-of-the-
Art (SOTA) models, or “Generative AI.” The development of such models is partly due to the challenge of limited
training data in NLP. While an abundance of textual data exists globally, creating task-specific datasets requires seg-
menting this vast collection into numerous distinct domains. This process yields only a few thousand or hundred thou-
sand human-annotated training examples. Unfortunately, deep learning NLP models require significantly larger data
quantities for optimal performance, with improvements observed when trained on millions or billions of labeled
examples.
Researchers have developed strategies for training universal language representation models using the extensive amounts
of unlabeled text available online, a process referred to as pre-training. These versatile pre-trained models can then be fine-
tuned using smaller, task-specific datasets for applications such as question-answering and sentiment analysis. This
approach yields considerable accuracy improvements compared to training exclusively on smaller, specialized datasets.
BERT, a recent development in NLP pre-training methodologies, has attracted attention in the deep learning community
due to its exceptional performance across various NLP tasks, including question-answering. These characteristics contrib-
ute to the increasing importance and applicability of foundation models in AI. One approach to utilizing foundation models
targets enterprises or research institutions that require precise, domain-specific models for purposes like automating work-
flows or accelerating scientific advancements. However, training a foundation model on the internet does not inherently
make it a domain expert, even if it appears credible to non-specialists in a particular field. While the AI revolution brings
considerable enthusiasm, managing foundation models is highly complex. The extensive process of converting data into a
functional model ready for deployment may necessitate weeks of manual labor and substantial computational resources.
Significant investments across all aspects, not just the models themselves, are crucial for fully harnessing the potential
of foundation models. Foundation models are expected to accelerate AI integration within businesses substantially.
By alleviating labeling requirements, organizations will find AI adoption considerably more straightforward. Ensuring
the capabilities of foundation models are accessible to all businesses in a hybrid cloud environment is of paramount impor-
tance. Multiple use cases include training a foundation model on various technical documents (e.g., equipment manuals,
product catalogs, how-to guides), using the model to label diverse images via a truly domain-expert AI, or generating
lines of code.
ChatGPT, an AI application developed by OpenAI, has garnered significant global interest. Designed as a chatbot-style
interface, ChatGPT allows users to pose open-ended questions or prompts for the model to answer, showcasing the poten-
tial of Large Language models. While ChatGPT primarily serves as a tool for tasks such as simplifying copywriting or
composing vows, it represents a critical milestone in the lengthy history of AI. Foundation models exhibit five distinct
characteristics:
• They are trained on vast amounts of unlabeled data: GPT-3, developed by OpenAI, is a well-known FM trained
on 499 billion tokens of text, encompassing web crawling, Reddit, and Wikipedia data, which equates to roughly 375
billion words.
4.4 BERT’s Functionality 287
• They are sizable: GPT-3 comprises 175 billion parameters, where parameters are adjustable weights used to determine a
model’s output. In comparison, linear regression models have two parameters, while GPT-3 possesses nearly 100 billion
times that amount.
• They are self-supervised: Self-supervised learning is a methodology in which a model is trained to recognize specific
data patterns autonomously, without annotated or labeled datasets, akin to how children acquire language with minimal
instruction.
• They are general: FMs are applicable across multiple tasks and do not require explicit training for a single task. They can
scale their performance without restarting for each task. For instance, GPT-3 can be employed for various tasks such as
question-answering, language translation, sentiment analysis, and more, without retraining.
• They are generative: FMs can generate novel ideas or content, particularly useful in knowledge tasks.
In this discussion, we will examine BERT. Its most attractive features are affordability and accessibility, as it can be down-
loaded and used without any cost. BERT models can be employed to extract high-quality linguistic features from text data or
fine-tuned for specific tasks such as question-answering, abstract summarization, sentence prediction, conversational
response generation, and using users’ data to produce state-of-the-art predictions. Language modeling, at its core, focuses
on predicting missing words within a given context. The primary objective of language models is to complete sentences by
estimating the probability of a word filling a blank space. For example, in the sentence, “Laura and Elsa traveled to Mont-
pellier and purchased a _____ of shoes,” a language model might estimate an 80% probability for the word “pair” and a 20%
probability for the word “cart.”
Before BERT, language models would analyze text sequences during training in a unidirectional manner, either from left
to right or by combining both left-to-right and right-to-left perspectives. This one-directional approach is effective for gen-
erating sentences, as the model can predict subsequent words and progressively construct a complete sentence.
BERT, however, introduces a bidirectional training approach, which is its key technical innovation. This bidirectionality
allows BERT to have a more profound understanding of language context and flow compared to single-direction language
models.
Instead of predicting the next word in a sequence, BERT employs a groundbreaking technique known as Masked Lan-
guage Modeling (MLM). This technique involves randomly masking words within a sentence and then predicting them. By
masking words, BERT is able to consider both the left and right contexts of a sentence when predicting a masked word. This
distinguishes BERT from previous language models such as LSTM-based models, which lacked the simultaneous consid-
eration of both preceding and following tokens. It may be more precise to describe BERT as non-directional rather than
bidirectional.
BERT’s bidirectional training is grounded in the Transformer architecture, which leverages self-attention mechanisms to
process input sequences in parallel, rather than sequentially. This allows BERT to efficiently capture long-range dependen-
cies and complex relationships between words, providing a richer understanding of the sentence’s meaning. By employing
this advanced mathematical framework, BERT has achieved state-of-the-art results on various NLP tasks, revolutionizing
the field of NLP. BERT relies on the Transformer model architecture, as opposed to LSTMs. A Transformer operates by
executing a fixed, minimal number of steps. In each step, it applies an attention mechanism to identify relationships among
all words in a sentence, independent of their respective positions. For example, consider the sentence, “The bat flew swiftly
in the dim cave.” To comprehend that “bat” denotes an animal rather than a piece of baseball equipment, the Transformer
can promptly focus on the words “flew” and “cave,” and deduce the correct meaning in just one step.
Upon searching “BERT” online, numerous variations can be found. BERT was originally designed for the English language
in two specific model dimensions: (i) BERT-BASE, featuring 12 encoders with 12 bidirectional self-attention heads, resulting
in 110 million parameters, and (ii) BERT-LARGE, comprising 24 encoders with 16 bidirectional self-attention heads, which
accumulates to 340 million parameters. Both versions were pre-trained utilizing the Toronto BookCorpus (800 million
words) and English Wikipedia (2500 million words).
BERT is grounded in a Transformer architecture, an attention mechanism that discerns contextual relationships among
words in a text. A fundamental Transformer comprises an encoder, which processes the text input, and a decoder, which
yields a prediction for a given task. BERT’s objective is to establish a language representation model, so it solely requires the
288 4 Natural Language Processing
encoder component. The encoder accepts a sequence of tokens as input, which are initially converted into vectors and sub-
sequently analyzed within the neural network. In preparation for processing, the input is enriched with the addition of
supplementary metadata. Token embeddings are utilized by incorporating a [CLS] token at the beginning of the first sen-
tence and appending a [SEP] token at the end of each sentence. Segment embeddings are employed by marking each token
with either Sentence A or Sentence B, allowing the encoder to distinguish between sentences. Lastly, positional embeddings
are assigned to every token to indicate its position within the sentence.
In essence, the Transformer constructs layers that map sequences to sequences, resulting in an output sequence of
vectors that correspond on a 1 : 1 basis with input and output tokens at identical indices. As previously noted, BERT does
not aim to predict subsequent words in a sentence.
Input [CLS] Mg daughter in smart [SEP] She has good grades [SEP]
Token Embeddings E[CLS] EMg Edaughter Ein Esmart E[SEP] EShe Ehas Egood Egrades E[SEP]
+ + + + + + + + + + +
Segment Embeddings EA EA EA EA EA EA EB EB EB EB EB
+ + + + + + + + + + +
Position Embeddings E0 E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
The training process incorporates a duo of primary methodologies. First, the Masked Language Model (MLM) strat-
egy is employed. It is founded on the idea of obfuscating 15% of input words by substituting them with a [MASK] token.
The entire sequence is then channeled through BERT’s attention-based encoder, with predictions made solely for the
masked words based on the context provided by the remaining non-masked words in the sequence. However, this ele-
mentary masking approach encounters a limitation: the model only attempts predictions when the [MASK] token is
present in the input, whereas the objective is to predict the accurate tokens irrespective of the input token. To resolve
this, the 15% of tokens chosen for masking undergo a series of adjustments. Eighty percent of these tokens are replaced
with the [MASK] token, 10% are exchanged with a random token, and the remaining 10% are left unchanged. During
training, the BERT loss function concentrates solely on the prediction of masked tokens, disregarding the predictions of
non-masked tokens. Consequently, the model converges at a notably slower pace compared to left-to-right or right-to-
left models.
Second, the Next Sentence Prediction (NSP) technique is utilized to understand the relationship between two sentences,
which proves beneficial for tasks such as question answering. In the training phase, the model is presented with pairs of
sentences, learning to predict whether the second sentence is a continuation of the first in the original text. BERT employs a
distinctive [SEP] token to delineate sentences. While training, the model is provided with two input sentences simultane-
ously, such that 50% of the time, the second sentence follows the first one directly, and 50% of the time, it is a random sen-
tence from the entire corpus. BERT is then required to predict whether the second sentence is random or not, presuming
that the random sentence is disconnected from the first. In order to ascertain whether the second sentence is linked to the
first, the complete input sequence is processed through the Transformer-based model. The output of the [CLS] token is
converted into a 2 × 1 shaped vector using a rudimentary classification layer, and the IsNext-Label is assigned via softmax.
The model is trained by combining both the MLM and NSP methodologies to minimize the joint loss function of the two
approaches.
4.5 Installing and Training BERT for Binary Text Classification Using TensorFlow
• Clone the BERT GitHub repository onto your computer. In your terminal, enter the following command: git clone https://
github.com/google-research/bert.git.
• Obtain the pre-trained BERT model files from the official BERT GitHub page. These files contain the weights, hyperpara-
meters, and other essential information BERT acquired during pre-training. Save these files in the directory where you
cloned the GitHub repository and then extract them. The following links provide access to different files:
– BERT-Large, Uncased (Whole Word Masking): 24-layer, 1024-hidden, 16-heads, 340M parameters: https://storage.
googleapis.com/bert_models/2019_05_30/wwm_uncased_L-24_H-1024_A-16.zip
4.5 Installing and Training BERT for Binary Text Classification Using TensorFlow 289
– BERT-Large, Cased (Whole Word Masking): 24-layer, 1024-hidden, 16-heads, 340M parameters: https://storage.goo-
gleapis.com/bert_models/2019_05_30/wwm_cased_L-24_H-1024_A-16.zip
– BERT-Base, Multilingual Cased: 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters: https://storage.goo-
gleapis.com/bert_models/2018_11_23/multi_cased_L-12_H-768_A-12.zip
– BERT-Base, Multilingual (Not recommended, use Multilingual Cased instead): 102 languages, 12-layer, 768-hid-
den, 12-heads, 110M parameters: https://storage.googleapis.com/bert_models/2018_11_03/multilingual_L-12_H-
768_A-12.zip
More can be found on GitHub: https://github.com/google-research/bert.git.
We observe that there are files designated as “cased” and “uncased,” indicating whether letter casing is deemed beneficial
for the task in question. In our example, we opted to download the BERT-Base-Cased model.
To utilize BERT, it is necessary to transform our data into the format BERT anticipates. BERT requires data to be in a TSV
file with a specific structure with four columns:
•• Column 0:
Column 1:
A unique identifier for each row.
An integer label for the row (class labels: 0, 1, 2, 3, etc.).
•• Column 2:
Column 3:
A consistent letter for all rows, included solely because BERT expects it, though it serves no purpose.
The text samples we aim to classify.
In the subsequent analysis, we will interact with the Yelp Reviews Polarity dataset. By leveraging the pandas library, we
will import and meticulously analyze this information. The dataset comprises user-generated assessments and ratings for a
wide variety of businesses, primarily centered on restaurants and local services, as featured on Yelp’s platform. This aggre-
gation of data provides crucial insights into consumer preferences, experiences, and viewpoints, thus facilitating businesses
and researchers in deciphering customer behavior and enhancing their offerings. The dataset can be accessed at https://
www.tensorflow.org/datasets/catalog/yelp_polarity_reviews. Designed for binary sentiment classification, the Yelp
Reviews Polarity dataset includes 560,000 highly polarized Yelp reviews for training and an additional 38,000 for testing.
This dataset originates from Yelp reviews and constitutes a portion of the Yelp Dataset Challenge 2015 data. For additional
details, please visit http://www.yelp.com/dataset. We can find the Jupyter Notebook at https://github.com/xaviervasques/
hephaistos/blob/main/Notebooks/BERT.ipynb.
As delineated above, it is necessary to create a folder within the directory where BERT was cloned, which will house three
distinct files: train.tsv, dev.tsv, and test.tsv (where TSV denotes tab-separated values). Both train.tsv and dev.tsv should
encompass all four columns, while test.tsv ought to contain only two columns, specifically the row ID and the text desig-
nated for classification.
Additionally, we should create a folder named “data” within the “bert” directory to store the .tsv files and another folder
called “bert_output” where the fine-tuned model will be saved. The pre-trained BERT model should be stored in the “bert”
directory as well.
The “bert” folder looks like this:
290 4 Natural Language Processing
# Use pandas read_csv function to load the Yelp Reviews Polarity dataset into a
DataFrame
# The dataset is stored in the file 'train.csv' and 'test.csv' with comma-separated
values
# Assign column names 'label' and 'text' to the respective columns in the DataFrames
df_bert_train = pd.read_csv('../data/datasets/yelp_review_polarity_csv/train.csv',
names=['label', 'text'])
df_bert_test = pd.read_csv('../data/datasets/yelp_review_polarity_csv/test.csv',
names=['label', 'text'])
# Display the first five rows of the training DataFrame to verify the data import
df_bert_train.head()
Output:
label text
Input:
# Use the LabelEncoder object to fit and transform the 'label' column in the DataFrame
# This converts the original labels into integer-encoded labels
df_bert_train['label'] = labelencoder.fit_transform(df_bert_train['label'])
df_bert_test['label'] = labelencoder.fit_transform(df_bert_test['label'])
# Show the first five rows of the DataFrame, displaying the transformed 'label' column
df_bert_train.head()
4.5 Installing and Training BERT for Binary Text Classification Using TensorFlow 291
Output:
label text
Input:
Output:
id label alpha text
Input:
Output:
id label alpha text
Input:
# Split the train set further into train and dev (development/validation) sets
df_bert_train, df_bert_dev = train_test_split(df_bert_train, test_size=0.01)
# Display the first five rows of each set (train, test, and dev) by calling the head()
method
df_bert_train.head(), df_bert_test.head(), df_bert_dev.head()
Output:
Input:
The training process of the model may require some time, depending on our computer’s capabilities, and the parameters
we have chosen. While training, the outputs displayed in your terminal will resemble the following:
export TRAINED_MODEL_CKPT=./bert_output/model.ckpt-413
Upon execution, a file named test_results.tsv will be produced, containing a number of columns equivalent to the total
count of class labels.
294 4 Natural Language Processing
For a comprehensive insight into BERT, it is highly recommended to consult the initial research paper, Devlin et al.
(2019), as well as the accompanying open-source GitHub repository (https://github.com/google-research/bert). Further-
more, an alternative implementation of BERT can be found within the PyTorch framework. We can establish a virtual envi-
ronment containing the necessary packages. We may use any package or environment manager. For example, Conda can be
utilized.
We can also install the PyTorch variant of BERT provided by Hugging Face.
Text Summarization refers to the computational technique of condensing a large body of data into a concise subset
(a summary) that encapsulates the most crucial and pertinent information found in the original content. In this
example, we will use Bert Extractive Summarizer (https://pypi.org/project/bert-extractive-summarizer/) that we need
to install:
The HuggingFace Pytorch transformers library is utilized by this tool to execute extractive summarizations by embedding
sentences and subsequently implementing a clustering algorithm. This technique pinpoints sentences in close proximity to
the cluster centroids, which are considered the most representative. To offer supplementary context, the library integrates
coreference resolution methodologies using the neuralcoref library, accessible at https://github.com/huggingface/neural-
coref. The CoreferenceHandler class facilitates the modification of the neuralcoref library’s greediness level, allowing for the
customization of the coreference resolution approach. In the most recent version of the bert-extractive-summarizer, if a
GPU is present, CUDA is employed by default to ensure optimal computational performance.
Let us summarize the following text that we can find in Wikipedia (https://en.wikipedia.org/wiki/Machine_learning):
Learning algorithms work on the basis that strategies, algorithms, and inferences that worked well in the past are likely to
continue working well in the future. These inferences can sometimes be obvious, such as ‘since the sun rose every morning
for the last 10,000 days, it will probably rise tomorrow morning as well.’ Other times, they can be more nuanced, such as ‘X
% of families have geographically separate species with color variants, so there is a Y% chance that undiscovered black
swans exist’.
Machine learning programs can perform tasks without being explicitly programmed to do so. It involves computers learning
from data provided so that they carry out certain tasks. For simple tasks assigned to computers, it is possible to program algo-
rithms telling the machine how to execute all steps required to solve the problem at hand; on the computer’s part, no learning is
needed. For more advanced tasks, it can be challenging for a human to manually create the needed algorithms. In practice, it
can turn out to be more effective to help the machine develop its own algorithm, rather than having human programmers specify
every needed step.
The discipline of machine learning employs various approaches to teach computers to accomplish tasks where no fully sat-
isfactory algorithm is available. In cases where vast numbers of potential answers exist, one approach is to label some of the
correct answers as valid. This can then be used as training data for the computer to improve the algorithm(s) it uses to determine
correct answers. For example, to train a system for the task of digital character recognition, the MNIST dataset of handwritten
digits has often been used.”
4.6 Utilizing BERT for Text Summarization 295
Input:
“'
# Call the summarizer model on the 'body' text with a specified minimum summary length
(in this case, 60 characters)
result = model(body, min_length=60)
# Join the resulting summarized text and store it in the 'full' variable
full = “.join(result)
Output:
Learning algorithms work on the basis that strategies, algorithms, and inferences that worked
well in the past are likely to continue working well in the future. For more advanced tasks, it
can be challenging for a human to manually create the needed algorithms. For example, to train a
system for the task of digital character recognition, the MNIST dataset of handwritten digits
has often been used.
296 4 Natural Language Processing
A variety of machine learning techniques have been developed to tackle the problem of question answering through NLP.
Previously, the bag-of-words method was prevalent, centering on responding to pre-established queries designed by devel-
opers. This approach necessitated considerable effort on the part of developers to generate questions and their respective
answers. Although beneficial for chatbots, the bag-of-words method faced difficulties in managing inquiries related to
extensive databases. Presently, the domain of NLP is predominantly influenced by transformer-based models such as BERT,
which have notably enhanced the field’s capabilities.
Let us take an example.
First, we install transformers through our terminal:
Input:
import torch
from transformers import BertForQuestionAnswering
from transformers import BertTokenizer
# Extract the input IDs (token embeddings) and token type IDs (segment embeddings)
inputs = encoding['input_ids']
sentence_embedding = encoding['token_type_ids']
# Obtain the start and end scores for the answer span from the BERT model
start_scores, end_scores = model(input_ids=torch.tensor([inputs]),
token_type_ids=torch.tensor([sentence_embedding]), return_dict=False)
# Iterate through each word in the answer and correct subword tokens
for word in answer.split():
# If it's a subword token, remove the '##' prefix and append to the corrected answer
if word[0:2] == '##':
corrected_answer += word[2:]
# Otherwise, add a space before the word and append to the corrected answer
else:
corrected_answer += ' ' + word
Output:
computers learning from data provided so that they carry out certain tasks . for simple tasks
assigned to computers , it is possible to program algorithms telling the machine how to execute
all steps required to solve the problem at hand ; on the computer ' s part , no learning is needed .
for more advanced tasks , it can be challenging for a human to manually create the needed
algorithms . in practice , it can turn out to be more effective to help the machine develop its own
algorithm , rather than having human programmers specify every needed step . the discipline of
machine learning employs various approaches to teach computers to accomplish tasks where no
fully satisfactory algorithm is available
Further Reading
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: pre-training of deep bidirectional transformers for language
understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational
298 4 Natural Language Processing
Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Minneapolis, Minnesota:
Association for Computational Linguistics.
Miller, D. (2019). Leveraging BERT for extractive text summarization on lectures. Computer Science, arXiv, https://doi.org/
10.48550/arXiv.1906.04165.
https://Www.Nltk.Org.
https://Spacy.Io.
https://Www.Kaggle.Com/Code/Ranjitmishra/Sms-Spam-Collection-Natural-Language-Processing.
https://Archive.Ics.Uci.Edu/Ml/Datasets/SMS+Spam+Collection.
http://Www.Grumbletext.Co.Uk.
http://Www.Esp.Uem.Es/Jmgomez/Smsspamcorpus/
https://www.kaggle.com/code/drvaibhavkumar/twitter-data-analysis-using-tweepy.
https://research.ibm.com/blog?tag=foundation-models.
https://medium.com/@samia.khalid/bert-explained-a-complete-guide-with-theory-and-tutorial-3ac9ebc8fa7c.
https://en.wikipedia.org/wiki/BERT_(language_model).
https://github.com/google-research/bert.
https://medium.com/swlh/a-simple-guide-on-using-bert-for-text-classification-bbf041ac8d04.
https://pypi.org/project/bert-extractive-summarizer/
299
The idea of designing and developing quantum computers first emerged in the early 1980s under the impetus of physicist
and Nobel Prize winner Richard Feynman. While a “classical” computer operates with bits of values 0 or 1, a quantum
computer uses the fundamental properties of quantum physics and relies on “quantum bits (qubits).” Beyond this
technological feat, quantum computing opens the way to processing computational tasks whose complexity is beyond
the reach of our current computers.
At the beginning of the twentieth century, the theories of so-called classical physics were unable to explain certain
problems observed by physicists. Therefore, they had to be reformulated and enriched. With the impetus of scientists,
physics evolved in the first place toward a “new mechanics” that would become “wave mechanics” and finally “quantum
mechanics.”
Quantum mechanics is the mathematical and physical theory that describes the fundamental structure of matter and
the evolution in time and space of phenomena on a microscopic scale. An essential notion of quantum mechanics is
the “wave–particle duality.” Until the 1890s, physicists considered that the world was composed of two types of objects
or particles: those that have mass (such as electrons, protons, neutrons, and atoms) and those that do not (such as photons,
waves, etc.). For the physicists of the time, these particles were governed by the laws of Newtonian mechanics for those that
have mass and by the laws of electromagnetism for waves. Therefore, we had two theories of “physics” to describe two
different types of objects.
Quantum mechanics invalidated this dichotomy and introduced the fundamental idea of “wave–particle duality”: a
particle of matter or a wave must be treated by the same laws of physics. This was the advent of wave mechanics, which
Machine Learning Theory and Applications: Hands-on Use Cases with Python on Classical and Quantum Machines, First Edition. Xavier Vasques.
© 2024 John Wiley & Sons, Inc. Published 2024 by John Wiley & Sons, Inc.
300 5 Machine Learning Algorithms in Quantum Computing
would become quantum mechanics a few years later. The development of quantum mechanics was associated with great
names such as Niels Bohr, Paul Dirac, Albert Einstein, Werner Heisenberg, Max Planck, Erwin Schrödinger, and many
others.
Max Planck and Albert Einstein, by studying the radiation emitted by a heated body and the photoelectric effect, were the
first to understand that exchanges of light energy could only occur in “packets” and not with any value. This is similar to a
staircase that only allows you to go up by the height of one step (or more) but not to reach any height between two steps.
Moreover, Albert Einstein was awarded the Nobel Prize in Physics for his theory on the quantized aspect of energy
exchanges in 1921 and not for the theory of relativity.
Niels Bohr extended the quantum postulates of Planck and Einstein from light to matter, proposing a model that repro-
duced the spectrum of the hydrogen atom. He was awarded the Nobel Prize in Physics in 1922, defining a model of the atom
that dictates the behavior of light quanta. Step by step, rules were found to calculate the properties of atoms, molecules, and
their interactions with light. From 1925 to 1927, a series of works by several physicists and mathematicians led to two gen-
eral theories applicable to these problems: Louis de Broglie’s wave mechanics and especially Erwin Schrödinger’s wave
mechanics, and Werner Heisenberg, Max Born, and Pascual Jordan’s matrix mechanics. These two mechanics were unified
by Erwin Schrödinger from a physical point of view and by John von Neumann from a mathematical point of view. Finally,
Paul Dirac formulated the complete synthesis or rather generalization of these two mechanics, which is now called quantum
mechanics. The fundamental equation of quantum mechanics is the Schrödinger equation:
d
H t ψ t = iℏ ψ t
dt
Prior to delving into the development of a “Quantum Information Theory,” let us revisit the operation of a standard
computer and the “Classical Information Theory” that governs the functioning of computers as we know them today.
The first binary computers were built in the 1940s, including Colossus (1943) and ENIAC (IBM – 1945). Colossus was
designed to decipher secret German messages, while ENIAC was created for calculating ballistic trajectories. In 1945,
ENIAC (an acronym for Electronic Numerical Integrator and Computer) became the first entirely electronic computer
constructed to be “Turing-complete,” meaning it could be reprogrammed to solve any computational problem, in principle.
ENIAC was programmed by women, referred to as the “ENIAC women,” with the most renowned being Kay McNulty, Betty
Jennings, Betty Holberton, Marlyn Wescoff, Frances Bilas, and Ruth Teitelbaum. Prior to this, these women had performed
ballistic calculations using mechanical desktop computers for the military. ENIAC weighed 30 tons, occupied an area of
72 m2, and consumed 140 kW.
Regardless of the task performed by a computer, the underlying process remains consistent: an instance of the task is
described by an algorithm that is translated into a series of 0s and 1s, which is then executed in the computer’s processor,
memory, and input/output devices. This is the foundation of binary calculation, which practically relies on electrical
circuits equipped with transistors that can be in two modes: “ON” allowing current to flow and “OFF” preventing current
flow.
Over the past 80 years, a classical information theory has been developed based on these 0s and 1s, constructed from
Boolean operators (XAND, XOR), words (bytes), and simple arithmetic operations such as “0 + 0 = 0, 0 + 1 = 1 + 0 = 1,
1 + 1 = 0 (with a carry),” and verifying if 1 = 1, 0 = 0, and 1 0. Naturally, more complex operations can be constructed
from these basic operations, allowing computers to perform trillions of such operations per second.
These 0s and 1s are contained in “BInary digiTs” or “bits,” which represent the smallest quantity of information within a
computer system.
For a quantum computer, the “qubit (quantum bit)” serves as the fundamental entity, representing the smallest unit
capable of manipulating information, analogous to the “bit.” A qubit possesses key properties of quantum mechanics such
as superposition, entanglement, measurement, or inference.
Superposition is about creating a quantum state that is a combination of | 0> and | 1>. A quantum object (at the
microscopic scale) can exist in an infinite number of states (as long as its state is not measured). Consequently, a qubit
can exist in any state between 0 and 1. Qubits can simultaneously assume the value 0 and 1, or more accurately, “a
certain amount of 0 and a certain amount of 1,” as a linear combination of two states denoted | 0> and | 1>, with
coefficients α and β. Hence, while a classical bit can only represent two states (0 or 1), a qubit can represent an “infinity”
of states. This is one of the potential advantages of quantum computing from an information theory perspective
(n-qubits – 2n quantum state dimensions).
Machine Learning Algorithms in Quantum Computing 301
t
|0>
|ψ>
θ
y
e
|1>
The vectors | 0> and | 1> are situated in a two-dimensional complex vector space, known as C2:
1 0
0 = and 1 =
0 1
As a result, we can represent any vector in C2 using the following formula:
a 0> +b 1>
The terms | 0> and | 1> are pronounced as “ket zero” and “ket one,” correspondingly, and are collectively recognized as
the computational basis.
Superposition involves generating a quantum state that constitutes a fusion of | 0> and | 1>:
a 0> +b 1>
in which a and b denote complex numbers:
2 2
a + b =1
Two quantum states are considered identical if they diverge solely by a constant multiple, designated as u, where | u | = 1.
This is attributed to the fact that
2 2 2 2
a + b = au + bu =1
Under these constraints, we can represent the qubit on the Bloch Sphere. Observe that when both a and b are non-zero,
the qubit’s state comprises both | 0> and | 1>. This concept is what is commonly referred to when people mention that a
qubit can simultaneously exist as “0 and 1.”
The act of measuring forces the qubit’s state, which is a | 0> + b | 1>, to collapse into either | 0> or | 1> upon observation.
In this scenario, | a | 2 represents the likelihood of obtaining | 0> upon measurement and | b | 2 represents the likelihood of
obtaining | 1> upon measurement.
For instance:
2 2
0 + 1
2 2
possesses an equal probability of collapsing into either | 0> or | 1>.
Another example:
3 1
0 − i 1
2 2
has a 75% chance of collapsing into | 0>.
302 5 Machine Learning Algorithms in Quantum Computing
If two or more of the values a, b, c, and d are non-zero, and it is not possible to separate the qubits, they become entangled,
exhibiting perfect correlation and losing their independence.
For example,
2 2
00 + 01 not entangled
2 2
2 2 2 2
01 − 10 , 00 + 11 entangled
2 2 2 2
We can express
2 2
00 + 01
2 2
as
2 2
0 0 + 1
2 2
but it is impossible to represent
2 2
00 + 11
2 2
as a combination of two individual qubit states, which means they are entangled. Upon measuring the initial qubit, the
second qubit becomes distinctly defined.
Quantum gates are the fundamental building blocks of quantum computing, analogous to classical logic gates in classical
computing. They are used to manipulate quantum bits, or qubits, by applying specific transformations to their quantum
states. Quantum gates operate on the principles of quantum mechanics, such as superposition and entanglement, allowing
for more complex and powerful computations than their classical counterparts.
Some examples of common quantum gates are:
1) Pauli-X gate: This is the quantum equivalent of the classical NOT gate. It flips the state of a qubit, transforming | 0> to |
1> and | 1> to | 0>.
2) Hadamard gate: This gate is used to create superposition in a qubit. When applied to a qubit in either the | 0> or | 1>
state, it generates an equal superposition of both states, resulting in ( | 0> + | 1>)/√2 or ( | 0> − | 1>)/√2, respectively.
3) Pauli-Y gate: Similar to the Pauli-X gate, the Pauli-Y gate flips the qubit states but also adds a complex phase. The trans-
formation maps | 0> to i | 1> and | 1> to -i | 0>, where i is the imaginary unit.
4) Pauli-Z gate: The Pauli-Z gate is a phase-flip gate that adds a phase of π to the | 1> state without affecting the | 0> state.
It maps | 0> to | 0> and | 1> to − | 1>.
5) CNOT gate (Controlled-NOT): This is a two-qubit gate where the first qubit acts as the control, and the second qubit is
the target. If the control qubit is in state | 1>, the target qubit’s state is flipped. If the control qubit is in state | 0>, the
target qubit remains unchanged.
These quantum gates, among others, can be combined and applied to qubits to perform various quantum computing tasks
and algorithms.
In quantum computing, the term “inference” generally refers to the process of extracting useful information or making
predictions based on the manipulation and measurement of qubits in a quantum system. This concept is particularly rel-
evant in the context of quantum machine learning, where quantum algorithms are used to analyze and process data to gain
insights, make predictions, or classify new data points. Quantum inference leverages the principles of quantum mechanics
5.1 Quantum Machine Learning 303
such as superposition and entanglement to enable more efficient data processing and potentially faster solutions than clas-
sical methods. Quantum algorithms such as Grover’s search algorithm and quantum phase estimation can be applied to
perform inference tasks, which may provide speedup over their classical counterparts.
Machine learning is changing the way businesses operate in fundamental ways and bringing new opportunities for progress,
alongside challenges. The capabilities of machine learning to interpret and analyze data have greatly increased. Yet,
machine learning is also demanding in terms of computing power because of more and more data to process and the com-
plexity of workflows. Machine learning and quantum computing are two technologies that can potentially allow us to solve
complex problems, previously untenable, and accelerate areas such as model training or pattern recognition. The future of
computing will certainly be made up of classical, biologically inspired, and quantum computing. The intersection between
quantum computing and machine learning has received considerable attention in recent years and has allowed the devel-
opment of quantum machine learning algorithms such as quantum-enhanced support vector machine (QSVM), QSVM mul-
ticlass classification, variational quantum classifier (VQC), or quantum generative adversarial networks (qGANs). Since the
birth of quantum computing, scientists have been searching for the best places to apply quantum algorithms. The first two
quantum algorithms published by Shor (1994) and Grover (1996) demonstrated that applying them to factorization and
theoretical searching could produce an advantage in comparison with classical computing. The study of machine learning
problems with quantum techniques is a new and active area of research. For example, a quantum neural network (QNN)
had been defined in general terms but was only defined at the physical level in 2000 by Ezhov and Ventura (2000). In 2003,
an approach was proposed by Rick and Ventura to train QNNs, but the method was exponentially complex. The introduc-
tion of quantum computing into clustering, distributed semantics, and SVMs by Lloyd et al. (2013), Blacoe et al. (2013), and
Rebentrost et al. (2014) was also limited too much to theory. The emergence of physical implementation of quantum com-
puters such as those of IBM has made it possible to translate research theory on quantum machine learning algorithms into
practice.
It is clear now that quantum computers have the potential to boost the performance of machine learning algorithms and
may contribute to breakthroughs in different fields such as drug discovery or fraud detection. Data can exhibit structures
that are difficult to identify, reducing classification accuracy. The idea is to find better patterns within machine learning/
deep learning processes by leveraging quantum systems that map data to higher dimensions for training and use. In a recent
paper, Havlíček et al. (2019) propose and describe the experimental implementation of two quantum algorithms on a super-
conducting processor. Both algorithms solve a problem of supervised learning, which is the construction of a classifier. Like
conventional SVMs, the quantum variational classifier uses a variational quantum circuit to classify data and the quantum
kernel estimator estimates the kernel function and optimizes a classical SVM. The reason we are exploring this area is
because when we use classical SVMs, we can be limited if the feature space is very large. In this case, the kernel functions
are computationally expensive to estimate. A key element in the paper is the use of the quantum state space as feature space,
as we can exploit the exponentially large quantum state space through controllable entanglement and interference. This is
in preparation for the quantum advantage, which refers to solving practically relevant challenges better or faster with quan-
tum computing in comparison to classical computers with the best-known hardware and the best-known classical solutions.
It is still an open question whether near-term quantum computers can be advantageous for machine learning tasks. Clearly,
many elements need to be resolved, but there is a path to improve training by considering more dimensions than possible
today and decreasing computational time, for example, by algorithms using kernel methods (linear, exponential, Gaussian,
hyperbolic, angular functions, etc.).
There are also hyperparameters in the machine learning models that need to be addressed, including regularization, batch
size, learning speed, and others. For example, part of establishing an artificial neural network is deciding how many layers of
hidden nodes will be used between the input layer and the output layer. Hyperparameter optimization is also an area in
which quantum computing could potentially assist.
Liu et al. (2017) provided more than assumptions by describing the construction of a classification problem and rigorously
proving a quantum speedup. The authors show that no classical learner can classify data inverse-polynomially better than
random guessing. They found mathematical proof of a quantum advantage for machine learning by developing a specific
304 5 Machine Learning Algorithms in Quantum Computing
task for which quantum kernel methods are better than classical methods. What does “better” mean here? Its quantum
advantage comes from the fact that we can construct a family of datasets for which only quantum computers can recognize
the intrinsic labeling patterns, while for classical computers the dataset looks like random noise. When they visualized the
data in a quantum feature map, the quantum algorithm was able to predict the labels very quickly and accurately. The idea
was to create classification problems based on discrete log, compute logarithms in a cyclic group, in which it is possible to
generate all the members of the group using a single mathematical operation. This specific problem can be solved by using
Shor’s algorithm, which would take a superpolynomial amount of time on a classical computer. Of course, we are talking
about a very specific problem in which the classification problem needs to fit into the cyclical structure. In real life, most
quantum algorithms do not perform better than conventional ones run on classical computers; there is room for
improvement.
In the machine learning/deep learning worlds, we can evoke the following potential use cases for quantum computing:
•• Financial services: recommendations of finance offers, credit and asset scoring, irregular behaviors (fraud).
Health care and life science: accelerated diagnosis, genomic analysis, clinical trial enhancements, medical image
processing.
The Large Hadron Collider accelerates beams of protons that collide a billion times per second at near-light speed, creat-
ing particles in the proton chaos measured by detectors. It produces 1 petabyte of data per second, which requires a million
classical CPU cores in 170 locations across the world. To be detectable, data from detectors should be processed by complex
computation. In a recent study, CERN and IBM investigated how to use quantum machine learning to detect and analyze
the Higgs boson (a particle that helps understand the origin of mass). Finding occurrences of the Higgs boson within a
maelstrom of data seriously challenges the limits of classical computers. In their study, the scientists showed that a
general-purpose quantum kernel algorithm that had not been optimized for particle physics, contrary to the classical
machine learning algorithms that had been optimized by CERN, was able to match CERN’s best classical algorithms.
The algorithm used was a QSVM.
To better understand quantum machine learning, we will use the open-source SDK Qiskit (https://qiskit.org) that allows
working with quantum computers and provides many algorithms and use cases. Qiskit supports Python version 3.6 or later.
The goal of this chapter is to apply quantum machine learning algorithms to real datasets. To understand the different con-
cepts, we can explore qiskit.org.
To install the environment, we can go to https://qiskit.org/documentation/getting_started.html.
In our environment, we can install Qiskit with pip:
Figure 5.1 Qubit connectivity of the 127-qubit superconducting quantum computer (ibm_sherbrooke).
While studying the Qiskit documentation, you will encounter references to the Qiskit Runtime primitives, which serve as
implementations of the Sampler and Estimator interfaces found in the qiskit.primitives module. These interfaces facilitate
the seamless interchangeability of primitive implementations with minimal code modifications. Various primitive
implementations can be found within the qiskit, qiskit_aer, and qiskit_ibm_runtime libraries, each designed to cater to
specific requirements:
• The primitives in the qiskit library enable efficient local state vector simulations, offering expedient algorithm prototyping
capabilities.
306 5 Machine Learning Algorithms in Quantum Computing
• The primitives in the qiskit_aer library provide access to local Aer simulators, which are instrumental in conducting
simulations involving noise.
• The primitives in the qiskit_ibm_runtime library grant access to cloud simulators and real quantum hardware through
the Qiskit Runtime service. These primitives incorporate exclusive features such as built-in circuit optimization and error
mitigation support.
Primitives constitute core functions that facilitate the construction of modular algorithms and applications, delivering
outcomes that transcend mere count values by providing more immediate and meaningful information. Moreover, they
offer a seamless pathway to harness the latest advancements in IBM Quantum hardware and software.
The initial release of Qiskit Runtime comprises two essential primitives:
For more comprehensive insights, detailed information is available in the following resource: https://qiskit.org/ecosys-
tem/ibm-runtime/tutorials/how-to-getting-started-with-sampler.html.
Since the beginning of their studies, scientists have been looking for the best applications and output of quantum algo-
rithms. As described above, the intersection between quantum computing and machine learning has received widespread
attention in recent years and has allowed the development of quantum machine learning algorithms such as QSVM, QVC,
and QNN.
Solving supervised machine learning problems with quantum techniques is a new area of research. In classical machine
learning, kernel methods are widely used, and the use of support vector machine (SVM) for classification is one of the most
frequent applications. SVMs have been widely used as binary classifiers and applied in recent years to solving multiclass
problems. In binary SVM, the objective is to create a hyperplane that linearly divides n-dimensional data points into two
components by searching for an optimal margin that correctly segregates the data into different classes. The hyperplane that
divides the input dataset into two groups can be either in the original feature space or in a higher-dimensional kernel space.
The selected optimal hyperplane from among many hyperplanes that might classify the data corresponds to the hyperplane
with the largest margin that allows the largest separation between the different classes. It is an optimization problem under
constraints in which the distance between the nearest data point and the optimal hyperplane (on each side) is maximized.
The hyperplane is then called the maximum-margin hyperplane, allowing creation of a maximum-margin classifier. The
closest data points are known as support vectors, and the margin is an area that generally does not contain any data points. If
the hyperplane defined as optimal is too close to the data points and the margin too small, it will be difficult to predict new
data and the model will fail to generalize well.
Several methods have been proposed to build multiclass SVMs based on binary SVM such as the all-pair approach in
which a binary classification problem for each pair of classes is used. In addition to linear classification, it is also possible
to compute a nonlinear classification using what is commonly called the kernel trick (a kernel function) that maps inputs
into high-dimensional feature spaces. The kernel function corresponds to an inner product of vectors in a potentially high-
dimensional Euclidian space referred to as the feature space. The objective of nonlinear SVM is to gain separation by map-
ping data to higher-dimensional space because many classification or regression problems are not linearly separable or
regressable in the space of the inputs x. The aim is to use the kernel trick to move to a higher-dimensionality feature space
given a suitable mapping x ϕ(x). To address data not tractable by linear methods, we need to choose a suitable feature
map. By combining classical kernel methods and quantum models, quantum kernel methods can shape new approaches in
machine learning. In early versions of quantum kernel methods, quantum feature maps encode the data points into inner
products or amplitudes in the Hilbert space. The number of features determines the number of qubits, and the quantum
circuit used to implement the feature map is of a length that is a linear or polylogarithmic function of the size of the dataset.
The work provided thus far to document or prove the advantage of a quantum feature map has been performed by carefully
choosing synthetic datasets or by application to small, binary classification problems. Data can exhibit structures that are
difficult to identify; therefore classification accuracy may be reduced.
5.2 Quantum Kernel Machine Learning 307
Our goal is to find better patterns using machine learning processes by leveraging quantum systems that map data to
higher dimensions for training purposes and use. The principle, considering a classical data vector x χ , is to map x to
an n-qubit quantum feature state ϕ(x) by a unitary encoding circuit U(x) such as ϕ(x) = U(x) 0n 0n U†(x). For two samples
x, x, the quantum kernel function K is defined as the inner products of two quantum feature states in the Hilbert–Schmidt
space K x, x = tr ϕ† x ϕ x and translated as the transition amplitude K x, x = 0n U † x U x 0n 2. For instance,
the kernel function can be estimated on a quantum computer by a procedure called quantum kernel estimation (QKE),
which consists of evolving the initial state 0n with U † x U x and recording the frequency of the all-zero outcome 0n.
The constructed kernel is then injected into a standard SVM. By replacing the classical kernel function with QKE, it
is possible to classify data in a quantum feature space with SVM. Rebentrost et al. (2013) proposed a theoretically feasible
quantum kernel approach based on SVM. In 2019, Havlíček et al. and Schuld and Killoran (2019) proposed two
implementations of quantum kernel methods. The authors experimentally implemented two quantum algorithms into a
superconducting processor. Like a conventional SVM, the quantum variational classifier uses a variational quantum circuit
to classify data, and the quantum kernel estimator estimates the kernel function and optimizes a classical SVM. Classifying
data using quantum algorithms may provide possible advantages such as increased speed for kernel computing compared to
classical computing and a potential improvement in classification accuracy. To achieve these advantages, it is mandatory to
find and explicitly specify a suitable feature map. This operation is not straightforward when compared to classic kernel
specification. Although theoretical work has shown a demonstrable advantage on synthetically generated datasets, concrete
and specific applications are needed to empirically study whether a quantum advantage may be reached and, if so, for what
kinds of datasets and applications. It is now a major challenge to find quantum kernels that could potentially provide
advantages on real-world datasets.
Richard Feynman suggested the use of quantum systems to efficiently simulate nature. Despite over a century of
research on cortical circuits, it is still unclear and not broadly accepted how many classes of cortical neurons exist.
The continuous development of new techniques and the availability of more and more data regarding phenotypes does
not allow the maintenance of a unit classification system that can be simple to update and that may take into consid-
eration all of the different characteristics of neurons. Neuronal classification remains a topic in progress because it is
still unclear how to designate a neuronal cell class and what are the best features to define it. The cortical circuit is
composed of glutamatergic neurons ( 80 to 90%) and GABAergic neurons ( 10 to 20%). GABAergic interneurons play
a critical role within cortical networks although they are a minority. Interneuron phenotypes are diverse, with more
than 20 distinct inhibitory interneuron cell types in rodents. They can be divided into several subtypes according to
their morphology, expression of histochemical markers, molecular functions, electrophysiology, and synaptic
properties.
In this chapter, we will apply quantum kernel methods to the quantitative characterization of neuronal morphologies
from histological neuronal reconstructions, which represent a primary resource to address anatomical comparisons and
morphometric analysis. Morphology-based classification of neuron types at the whole-brain level in the rat remains a
challenge, as it is not clear how to designate a neuronal cell given the significant number of neuron types, the limited
samples (reconstructed neuron), the best features by which to define them, and diverse data formats such as two- and
three-dimensional images (structured with high dimensions and fewer samples than the complexity of morphologies)
and SWC-format files (low-dimensional and unstructured). There are several reasons why neuroscientists are interested
in this topic. Some brain diseases affect specific cell types, and current knowledge may be improved by correlating some
disorders to the underlying vulnerable neuronal types. Neuron morphology studies can lead to the identification of genes to
target for specific cell morphology and functions linked to them. The discovery of gene functions in specific cell types can be
used as entry points. Before acquiring its form and function, a neuron goes through different stages of development that
need to be understood to identify new markers, marker combinations, or mediators of developmental choices. The under-
standing of neuron morphologies often represents the basis of modeling efforts and data-driven modeling approaches to
study the impact of a cell’s morphology on its electrical behavior and on the network in which it is embedded. Classification
by neuron subtypes represents a way to reduce dimensionality for modeling purposes.
Even if its principle is almost the same as the classical kernel method, the quantum kernel method is based on quantum
computing properties and maps a data point from an original space to a quantum Hilbert space. The quantum mapping
function is critically important in the process of quantum kernel methods, with a direct impact on model’s performance.
Finding a suitable feature map in the context of gate-based quantum computers is less trivial than simply specifying a
suitable kernel in classical algorithms. In quantum kernel machine learning, a classical feature vector x is mapped to a
2
quantum Hilbert space using a quantum feature map Φ x such that K ij = Φ† x j Φ xi . There are important
308 5 Machine Learning Algorithms in Quantum Computing
factors to evaluate when considering a feature map such as the feature map circuit depth, the data map function for
encoding the classical data, the quantum gate set, and the order expansion. We can find different types of feature maps
such as ZFeatureMap, which implements a first-order diagonal expansion where | S | = 1. We need to establish different
parameters such as the feature dimensions (dimensionality of the data, which is equal to the number of required qubits), the
number of times the feature map circuit is repeated (reps), and a function encoding the classical data. Here, there is no
entanglement because there are no interactions between features. Another example is the ZZFeatureMap, which is a sec-
ond-order Pauli-Z evolution circuit allowing | S | ≤ 2 and Φ, a classical nonlinear function. Here, interactions in the data are
encoded in the feature map according to the connectivity graph and the classical data map. Like ZFeatureMap, ZZFeature-
Map needs the same parameters as well as an additional one, which is the entanglement that generates connectivity (“full,”
“linear,” or its own entanglement structure). The PauliFeatureMap is the general form of the feature map that allows us to
create feature maps using different gates. It transforms input data x Rn such that the following is true:
UΦ x = exp i ϕS x Pi
S n i S
where Pi {I, X, Y, Z} denotes the Pauli matrices and S denotes the connectivities between different qubits or data points:
k combinations, k = 1, …, n . For k = 1 and P0, we can refer to ZFeatureMap and ZZFeatureMap for k = 2 and
n
S
P0 = Z and P0, 1 = ZZ.
In this chapter, we test eight quantum kernel algorithms. The first one, named q_kernel_zz, applies a ZZFeatureMap.
As described by Havlíček et al. (2019), we define a feature map on n-qubits generated by the unitary:
⨂n ⨂n
Φ x = UΦ x H UΦ x H
where H denotes the conventional Hadamard gate and U ϕ x is a diagonal gate in the Pauli-Z basis.
UΦ x = exp i ϕS x Zi
S n i S
where
The encoding function that transforms the input data into a higher-dimensional feature space is given by the following:
Φ x = ϕ1 x , ϕ2 x , ϕ1,2 x
To create a feature map and test different encoding functions, we use the encoding function from Havlíček et al. (2019):
ϕ i x = x i and ϕ 1,2 x = π − x1 π − x2 q_kernel_zz
π
ϕ i x = x i and ϕ 1,2 x = 1 − x1 1 − x2 q_kernel_9
2
x1 − x2 2
ϕ x = x i and ϕ x = exp q_kernel_10
i 1,2
8 ln π
π
ϕ i x = x i and ϕ 1,2 x = q_kernel_11
3 cos x 1 cos x 2
ϕ i x = x i and ϕ 1,2 x = π cos x 1 cos x 2 q_kernel_12
For q_kernel_8, q_kernel_9, q_kernel_10, q_kernel_11, and q_kernel_12, we use the PauliFeatureMap
(paulis = [‘ZI’,‘IZ’,‘ZZ’]).
It is also possible to train a quantum kernel with quantum kernel alignment (QKA), which iteratively adapts a
parameterized quantum kernel to a dataset and converges to the maximum SVM margin at the same time (we have named
it q_kernel_training). The algorithm introduced by Glick et al. (2021) allows, from a family of kernels (covariant quantum
kernels that are related to covariant quantum measurements), the learning of a quantum kernel; at the same time,
converging to the maximum SVM margin optimizes the parameters in a quantum circuit. To implement it, we prepare
the dataset as usual and define the quantum feature map. We then use the QuantumKernelTrained.fit method to train
the kernel parameters and pass it to a machine learning model. In covariant quantum kernels, the feature map is defined
by a unitary representation D(x) for x χ and a state ψ = U 0n . The kernel matrix is given as follows:
K x, x = 0n U † D † x D x U 0 n 2
For a given group, the quantum kernel alignment is used to find the optimal fiducial state. In the context of covariant
quantum kernels, the equation can be extended to the following:
K λ x, x = 0n U †λ D† x D x U λ 0n 2
where quantum kernel alignment will learn an optimal fiducial state parametrized by λ for a given group.
In order to train the quantum kernel, we will employ an instance of TrainableFidelityQuantumKernel, encompassing
both the feature map and its associated parameters. Following the training of the quantum kernels, we will proceed to utilize
Qiskit’s Quantum Variational Classifier (QVC) for classification tasks (available at https://qiskit.org). The TrainableFide-
lityQuantumKernel and QuantumKernelTrainer will be employed for managing the training process within the context of
q_kernel_training. Specifically, the Quantum Kernel Alignment technique will be employed during training, with the
selection of the kernel loss function SVCLoss as the input to QuantumKernelTrainer.
Given that SVCLoss is a supported loss function in Qiskit, we can utilize the string representation “svc_loss”; however,
it is important to note that default settings will be employed when passing the loss as a string. Should custom settings be
desired, it is necessary to explicitly instantiate the desired options and pass the KernelLoss object to the
QuantumKernelTrainer.
To optimize the training process, the SPSA optimizer will be selected, and the trainable parameters will be initialized
using the initial_point argument. It is worth mentioning that the length of the list provided as the initial_point argument
must be equal to the number of trainable parameters present in the feature map.
Let us start coding and first import the necessary libraries.
Input:
The computational execution can be tailored to utilize either local simulators (Aer simulators) or online quantum simu-
lators and real quantum hardware. The following lines can be commented or uncommented as per the desired choice. If the
intention is to run the code on a local simulator, the following lines can be added.
310 5 Machine Learning Algorithms in Quantum Computing
Input:
Alternatively, if our preference is to execute the code using online quantum simulators or real quantum hardware, the
following lines can be added:
Input:
## Compute code with online quantum simulators or quantum hardware from the cloud
# Import QiskitRuntimeService and Sampler
from qiskit_ibm_runtime import QiskitRuntimeService, Sampler
# Define service
service = QiskitRuntimeService(channel = 'ibm_quantum', token = " YOUR TOKEN ",
instance = 'ibm-q/open/main')
# Get backend
quantum_backend = "ibmq_bogota"
backend = service.backend(quantum_backend) # Use a simulator or hardware from the
cloud
# Define Sampler with different options
# resilience_level=1 adds readout error mitigation
# execution.shots is the number of shots
# optimization_level=3 adds dynamical decoupling
from qiskit_ibm_runtime import Options
options = Options()
options.resilience_level = 1
options.execution.shots = 1024
options.optimization_level = 3
sampler = Sampler(session=backend, options = options)
To utilize this final option, it is essential to possess an IBM Quantum account, which can be obtained through the fol-
lowing link: https://quantum-computing.ibm.com. Signing in to the account will grant access to a personal token that can
be directly incorporated into our code. Additionally, by visiting https://quantum-computing.ibm.com, we gain visibility into
all the available online simulators and hardware systems.
We need also to set a seed for randomization to keep outputs consistent.
Input:
# Encoding Functions
from functools import reduce
5.2 Quantum Kernel Machine Learning 311
qfm_default = PauliFeatureMap(feature_dimension=2,
paulis = ['ZI','IZ','ZZ'],
reps=2, entanglement='full')
print(qfm_default)
qfm_8 = PauliFeatureMap(feature_dimension=2,
paulis = ['ZI','IZ','ZZ'],
reps=2, entanglement='full', data_map_func=data_map_8)
print(qfm_8)
qfm_9 = PauliFeatureMap(feature_dimension=2,
paulis = ['ZI','IZ','ZZ'],
reps=2, entanglement='full', data_map_func=data_map_9)
print(qfm_9)
qfm_10 = PauliFeatureMap(feature_dimension=2,
paulis = ['ZI','IZ','ZZ'],
reps=2, entanglement='full', data_map_func=data_map_10)
print(qfm_10)
qfm_11 = PauliFeatureMap(feature_dimension=2,
paulis = ['ZI','IZ','ZZ'],
reps=2, entanglement='full', data_map_func=data_map_11)
print(qfm_11)
qfm_12 = PauliFeatureMap(feature_dimension=2,
paulis = ['ZI','IZ','ZZ'],
reps=2, entanglement='full', data_map_func=data_map_12)
print(qfm_12)
312 5 Machine Learning Algorithms in Quantum Computing
Output:
q_0: 0
PauliFeatureMap(x[0], x[1] )
q_1: 1
q_0: 0
PauliFeatureMap(x[0], x[1] )
q_1: 1
q_0: 0
PauliFeatureMap(x[0], x[1] )
q_1: 1
q_0: 0
PauliFeatureMap(x[0], x[1] )
q_1: 1
q_0: 0
PauliFeatureMap(x[0], x[1] )
q_1: 1
q_0: 0
PauliFeatureMap(x[0], x[1] )
q_1: 1
The code makes use of the default implementation of the Sampler primitive and employs the ComputeUncompute fidelity
measure to calculate the overlaps between states. In the event that specific instances of Sampler or Fidelity are not provided,
the code will automatically generate these objects with the default values.
Input:
We process our data as we have done previously in classical computing (load data, process missing data, split data, nor-
malize, use PCA):
Input:
# Import dataset
data = '../data/datasets/neurons_binary.csv'
df = pd.read_csv(data, delimiter=';')
5.2 Quantum Kernel Machine Learning 313
# Divide the data, y the variable to predict (Target) and X the features
X = df[df.columns[1:]]
y = df['Target']
classifiers = [
QSVC(quantum_kernel=Q_Kernel_default),
QSVC(quantum_kernel=Q_Kernel_8),
QSVC(quantum_kernel=Q_Kernel_9),
QSVC(quantum_kernel=Q_Kernel_10),
314 5 Machine Learning Algorithms in Quantum Computing
QSVC(quantum_kernel=Q_Kernel_11),
QSVC(quantum_kernel=Q_Kernel_12),
]
Output:
As we have done previously in classical computing, we can provide metrics about our model (accuracy, precision, recall, f1
score, cross-validation, classification report).
Input:
print("\n")
print("Print predicted data coming from X_test as new input data")
print(y_pred)
print("\n")
print("Print real values\n")
print(y_test)
print("\n")
Output:
Q_Kernel_default
9 0
30 3
19 1
35 3
0 0
21 2
3 0
29 2
Name: Target, dtype: int64
Accuracy: 0.125
Precision: 0.125
Recall: 0.125
f1 Score: 0.125
Cross Validation Mean: 0.11904761904761904
Cross Validation Std: 0.10858813572372743
Q_Kernel_8
9 0
30 3
19 1
35 3
0 0
21 2
3 0
29 2
Name: Target, dtype: int64
Accuracy: 0.125
Precision: 0.125
Recall: 0.125
f1 Score: 0.125
Cross Validation Mean: 0.21904761904761902
Cross Validation Std: 0.07589227357385346
316 5 Machine Learning Algorithms in Quantum Computing
Q_Kernel_9
9 0
30 3
19 1
35 3
0 0
21 2
3 0
29 2
Name: Target, dtype: int64
Accuracy: 0.125
Precision: 0.125
Recall: 0.125
f1 Score: 0.125
Cross Validation Mean: 0.2523809523809524
Cross Validation Std: 0.08192690730516787
Q_Kernel_10
9 0
30 3
19 1
35 3
0 0
21 2
3 0
29 2
Name: Target, dtype: int64
Accuracy: 0.125
Precision: 0.125
Recall: 0.125
f1 Score: 0.125
Cross Validation Mean: 0.1571428571428571
Cross Validation Std: 0.09110060223670947
5.2 Quantum Kernel Machine Learning 317
Q_Kernel_11
9 0
30 3
19 1
35 3
0 0
21 2
3 0
29 2
Name: Target, dtype: int64
Accuracy: 0.25
Precision: 0.25
Recall: 0.25
f1 Score: 0.25
Cross Validation Mean: 0.28095238095238095
Cross Validation Std: 0.1846735183777649
Q_Kernel_12
9 0
30 3
19 1
35 3
0 0
21 2
3 0
29 2
Name: Target, dtype: int64
Accuracy: 0.25
Precision: 0.25
Recall: 0.25
f1 Score: 0.25
Cross Validation Mean: 0.22380952380952376
Cross Validation Std: 0.08984743935292004
Classification Report:
318 5 Machine Learning Algorithms in Quantum Computing
accuracy 0.25 8
macro avg 0.13 0.38 0.18 8
weighted avg 0.11 0.25 0.14 8
Another coding example with q_kernel_zz is described below. We will use the ZZFeatureMap with linear entangle-
ment, we will repeat the data encoding step two times, and we will use feature selection (embedded decision tree) to select
five features (on a very small dataset of 260 neurons). We will use five qubits on the StatevectorSimulator from the IBM
Quantum framework (https://qiskit.org/). The simulator models the noiseless execution of quantum computer hardware,
representing the ideal. It evaluates the resulting quantum state vector. For each experiment, to assess whether patterns
are identifiable, we will train the supervised classification algorithms using 80% of each sample, randomly chosen, and we
will assess the accuracy in predicting the remaining 20%. The accuracy of the classification will be assessed by performing
cross-validation on the training dataset. We will average the fivefold cross-validation scores resulting in a mean ± standard
deviation score. For data rescaling, we will use QuantileTransformer. We will also transform the dataset using a Maha-
lanobis transformation with suppression of a neuron when the surface of the soma equals 0. The Mahalanobis distance is
a multivariate metric measuring the distance between a point and a distribution. Applying the Mahalanobis distance
allows reduction of the standard deviation for each feature by deleting neurons from the dataset. The datasets will be
preprocessed to address missing values. If a value within the features is missing, the neuron will be deleted from the
dataset. Categorical features such as morphology types will be encoded, transforming each categorical feature with m
possible values.
Input:
# Import utilities
import numpy as np
import pandas as pd
# Define parameters
cv = 5 # 5-fold cross-validation
feature_dimension = 5 # Features dimension
k_features = 5 # Feature selection
reps = 2 # Repetition
ibm_account = 'YOUR TOKEN'
quantum_backend = 'simulator_statevector'
# Import dataset
data = '../data/datasets/neurons_maha_soma.csv'
neuron = pd.read_csv(data, delimiter=',')
print(neuron)
df = neuron.head(22).copy() # Ganglion
df = pd.concat([df, neuron.iloc[320:340]]) # Granule
5.2 Quantum Kernel Machine Learning 319
Inputs:
- X (features) DataFrame
- y (target) DataFrame
'''
print("\n")
print("Decision Tree Regressor Features Importance: started")
print("\n")
print("\n")
print("Decision Tree Classifier Features Importance: DataFrame")
print("\n")
print(df_data)
return df_data
y_train = y_train.values.ravel()
y_test = y_test.values.ravel()
print(qfm_zz)
# QSVC model
model = QSVC(quantum_kernel=Q_Kernel_zz)
model.fit(X_train,y_train)
score = model.score(X_test, y_test)
print(f'Callable kernel classification test score for q_kernel_zz: {score}')
y_pred = model.predict(X_test)
print("\n")
print("Print predicted data coming from X_test as new input data")
print(y_pred)
print("\n")
print("Print real values\n")
print(y_test)
print("\n")
# score classifier
score[i] = model.score(X_test_, y_test_)
5.2 Quantum Kernel Machine Learning 323
i = i + 1
import math
print("cross validation scores: ", score)
cross_mean = sum(score) / len(score)
cross_var = sum(pow(x - cross_mean,2) for x in score) / len(score) # variance
cross_std = math.sqrt(cross_var) # standard deviation
print("cross validation mean: ", cross_mean)
print(metrics_dataframe)
Output:
Target Soma_Surface N_stems N_bifs N_branch N_tips \
0 ganglion 1149.320 4.0 101.0 206.0 106.0
1 ganglion 1511.830 3.0 70.0 143.0 74.0
2 ganglion 1831.530 3.0 13.0 29.0 17.0
3 ganglion 1291.270 6.0 109.0 224.0 116.0
4 ganglion 3064.340 4.0 60.0 124.0 65.0
... ... ... ... ... ... ...
22686 double_bouquet 605.067 5.0 132.0 269.0 138.0
22687 double_bouquet 920.949 6.0 121.0 248.0 128.0
22688 double_bouquet 770.529 3.0 104.0 211.0 108.0
22689 double_bouquet 478.078 4.0 158.0 320.0 163.0
22690 double_bouquet 629.470 4.0 65.0 134.0 70.0
Features Importances:
Features Importances
0 Soma_Surface 0.132398
1 N_stems 0.019728
2 N_bifs 0.000000
3 N_branch 0.000000
4 N_tips 0.000000
5 Width 0.000000
6 Height 0.174589
7 Depth 0.070581
8 Type 0.115796
9 Diameter 0.000000
10 Diameter_pow 0.000000
11 Length 0.000000
12 Surface 0.000000
13 SectionArea 0.022928
14 Volume 0.000000
15 EucDistance 0.000000
16 PathDistance 0.095161
17 Branch_Order 0.000000
18 Terminal_degree 0.000000
19 TerminalSegment 0.009220
20 Taper_1 0.076445
5.2 Quantum Kernel Machine Learning 325
21 Taper_2 0.000000
22 Branch_pathlength 0.044744
23 Contraction 0.000000
24 Fragmentation 0.000000
25 Daughter_Ratio 0.024061
26 Parent_Daughter_Ratio 0.000000
27 Partition_asymmetry 0.000000
28 Rall_Power 0.014306
29 Pk 0.008298
30 Pk_classic 0.000000
31 Pk_2 0.000000
32 Bif_ampl_local 0.066446
33 Bif_ampl_remote 0.027766
34 Bif_tilt_local 0.000000
35 Bif_tilt_remote 0.000000
36 Bif_torque_local 0.000000
37 Bif_torque_remote 0.000000
38 Last_parent_diam 0.016321
39 Diam_threshold 0.020529
40 HillmanThreshold 0.045931
41 Helix 0.014752
42 Fractal_Dim 0.000000
q_0: 0
q_1: 1
q_3: 3
q_4: 4
326 5 Machine Learning Algorithms in Quantum Computing
ibmq_qasm_simulator
ibmq_lima
ibmq_belem
ibmq_quito
simulator_statevector
simulator_mps
simulator_extended_stabilizer
simulator_stabilizer
ibmq_manila
ibm_nairobi
ibm_oslo
Callable kernel classification test score for q_kernel_zz: 0.4716981132075472
[ 9 10 7 4 11 2 4 9 5 1 6 0 8 10 7 9 11 5 10 11 2 3 5 10
1 3 11 11 12 5 6 7 0 1 6 7 11 0 10 5 9 8 1 11 6 3 0 5
12 2 9 4 1]
[0. 0. 0. 0. 0.]
cross validation scores: [0.38095238 0.5 0.57142857 0.42857143 0.58536585]
cross validation mean: 0.4932636469221835
Classification Report:
precision recall f1-score support
accuracy 0.47 53
macro avg 0.49 0.43 0.44 53
weighted avg 0.56 0.47 0.50 53
q_kernel_zz
Accuracy 0.471698
Precision 0.471698
Recall 0.471698
F1 Score 0.471698
Cross-validation mean 0.493264
Cross-validation std 0.079293
5.2 Quantum Kernel Machine Learning 327
As we can see, the results improved. If we now apply the same algorithm by varying the size of the dataset, the cross-
validation score will improve (Table 5.2).
When applying the q_kernel_zz on the entire dataset, we have the following results:
q_kernal_zz (5-qubits) for classification of neuron morphologies with
five selected features
0.93
0.90
Mean cross-validation score
0.85
0.80
0.79
0.75
0.70 0.70
0.65 0.61
0.61
0.60
0.55
0.50
Sample 1 Sample 2 Sample 3 Sample 4 Sample 5
Quantum (5-qubits, q_kernel_zz, Quantile-uniform, Decision tree)
Table 5.2 Dataset with the number of neuron morphologies for multiclass classification. From the 27,881 extracted neurons, 22,691
neurons (Sample 5) remained after the application of Mahalanobis distance transformation and suppression of all neurons with a soma
surface equal to 0.
Principal cells
Ganglion 20 50 100 200 318
Granule 20 50 100 200 851
Medium Spiny 20 50 100 200 762
Parachromaffin 20 50 100 200 322
Pyramidal 20 50 100 200 12,558
Interneurons
Basket 20 50 100 200 470
Bitufted 20 50 67 67 55
Chandelier 20 26 26 26 24
Double bouquet 20 50 50 50 49
Martinotti 20 50 100 137 107
Nitrergic 20 50 100 200 1771
Glial cells
Astrocytes 20 50 100 200 450
Microglia 20 50 100 200 4954
Total 260 626 1143 2080 22691
328 5 Machine Learning Algorithms in Quantum Computing
It is also possible to train a quantum kernel with quantum kernel alignment (QKA), which iteratively adapts a parameter-
ized quantum kernel to a dataset and converges to the maximum SVM margin at the same time. To implement it, we prepare
the dataset as usual and define the quantum feature map. Then, we use the QuantumKernelTrained.fit method to train the
kernel parameters and pass them to a machine learning model.
Input:
# Import utilities
import numpy as np
import pandas as pd
# Import dataset
data = '../data/datasets/neurons.csv'
neuron = pd.read_csv(data, delimiter=';')
# Divide the data, y the variable to predict (Target) and X the features
X = df[df.columns[1:]]
y = df['Target']
def get_callback_data(self ):
return self._data
def clear_callback_data(self ):
self._data = [[] for i in range(5)]
# Qiskit imports
from qiskit import QuantumCircuit
from qiskit.circuit import ParameterVector
330 5 Machine Learning Algorithms in Quantum Computing
# Create a rotational layer to train. We will rotate each qubit the same amount.
training_params = ParameterVector("θ", 1)
fm0 = QuantumCircuit(feature_dimension)
for qubit in range(feature_dimension):
fm0.ry(training_params[0], qubit)
print(circuit_drawer(fm))
print(f"Trainable parameters: {training_params}")
metrics_dataframe
Output:
q_0: Ry( θ [ 0 ] ) 0
ZZFeatureMap(x[0], x[1] )
q_1: Ry( θ [ 0 ] ) 1
q_0: 0
ZZFeatureMap (x[0], x[1] )
q_1: 1
377 2
17034 0
381 2
17051 0
17052 0
374 2
9 1
17041 0
5.4 Pegasos QSVC: Binary Classification 333
388 2
378 2
382 2
384 2
Name: Target, dtype: int64
Classification Report:
accuracy 0.42 12
macro avg 0.43 0.56 0.39 12
weighted avg 0.57 0.42 0.44 12
There is an alternative method to QSVC (which uses the dual optimization from scikit-learn), namely the Pegasos algorithm
from Shalev-Shwartz in which another SVM-based algorithm benefits from the quantum kernel method. PegasosQSVC
yields a training complexity that is independent of the size of the training set, meaning that this method can train faster
than QSVC with large training sets. We need to optimize some hyperparameters.
Input:
# Import utilities
import numpy as np
import pandas as pd
# regularization parameter
C = 1000
# Encoding Functions
from functools import reduce
qfm_default = PauliFeatureMap(feature_dimension=2,
paulis = ['ZI','IZ','ZZ'],
reps=2, entanglement='full')
print(qfm_default)
qfm_8 = PauliFeatureMap(feature_dimension=2,
paulis = ['ZI','IZ','ZZ'],
reps=2, entanglement='full', data_map_func=data_map_8)
print(qfm_8)
qfm_9 = PauliFeatureMap(feature_dimension=2,
paulis = ['ZI','IZ','ZZ'],
reps=2, entanglement='full', data_map_func=data_map_9)
print(qfm_9)
qfm_10 = PauliFeatureMap(feature_dimension=2,
paulis = ['ZI','IZ','ZZ'],
reps=2, entanglement='full', data_map_func=data_map_10)
print(qfm_10)
qfm_11 = PauliFeatureMap(feature_dimension=2,
paulis = ['ZI','IZ','ZZ'],
reps=2, entanglement='full', data_map_func=data_map_11)
print(qfm_11)
qfm_12 = PauliFeatureMap(feature_dimension=2,
paulis = ['ZI','IZ','ZZ'],
reps=2, entanglement='full', data_map_func=data_map_12)
print(qfm_12)
print(qfm_8.draw())
classifiers = [
PegasosQSVC(quantum_kernel=Q_Kernel_default, C=C, num_steps=tau),
PegasosQSVC(quantum_kernel=Q_Kernel_8, C=C, num_steps=tau),
PegasosQSVC(quantum_kernel=Q_Kernel_9, C=C, num_steps=tau),
PegasosQSVC(quantum_kernel=Q_Kernel_10, C=C, num_steps=tau),
PegasosQSVC(quantum_kernel=Q_Kernel_11, C=C, num_steps=tau),
PegasosQSVC(quantum_kernel=Q_Kernel_12, C=C, num_steps=tau),
]
# Import dataset
data = '../data/datasets/neurons_test.csv'
df = pd.read_csv(data, delimiter=';')
# Divide the data, y the variable to predict (Target) and X the features
X = df[df.columns[1:]]
y = df['Target']
print(X_train)
print(X_test)
print("\n")
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print("\n")
print("Print predicted data coming from X_test as new input data")
print(y_pred)
print("\n")
print("Print real values\n")
print(y_test)
print("\n")
print('Metrics:')
print(metrics_dataframe)
Within the realm of quantum computing, machine learning models such as QNNs have emerged as a subclass of variational
quantum algorithms. Feature maps, which find applications in QNN architectures, aim to harness quantum principles such
as superposition, entanglement, and interference to enhance accuracy, expedite training, and accelerate model processing.
While the debate between classical and quantum methods continues, some research papers have demonstrated theoretical
advantages of future quantum systems.
In the context of Qiskit-Machine Learning, the NeuralNetworkClassifier and NeuralNetworkRegressor modules are
employed. These modules accept a (Quantum) NeuralNetwork as input and utilize it within specific contexts. For conven-
ience, pre-configured variants, namely the QVC (VQC) and Variational Quantum Regressor (VQR), are also available.
338 5 Machine Learning Algorithms in Quantum Computing
• VQC
In the case of an EstimatorQNN used for classification within a NeuralNetworkClassifier, the EstimatorQNN is expected
to produce one-dimensional output within the range of [−1, +1]. This setup is suitable for binary classification, where the
two classes are assigned the values of −1 and +1, respectively.
Alternatively, a SamplerQNN employed for classification within a NeuralNetworkClassifier expects a d-dimensional
probability vector as output, where d represents the number of classes. The underlying Sampler primitive generates
quasi-distributions of bit strings, and a mapping from the measured bit strings to the different classes needs to be defined.
For binary classification, a parity mapping is utilized.
The VQC serves as a specialized variant of the NeuralNetworkClassifier, employing a SamplerQNN. It applies a parity
mapping (or extensions for multiple classes) to convert the bit string to the corresponding classification, resulting in a
probability vector interpreted as a one-hot encoded result. By default, the VQC utilizes the CrossEntropyLoss function,
which expects labels in a one-hot encoded format and returns predictions in the same format.
•• Divide the data, with y the variable to predict (target) and X the features.
Split the data into training (X_train, y_train) and test (30%).
We then create a quantum instance and build our QNN, create the neural network classifier, fit the classifier to our data,
and give the score and prediction using our testing dataset.
In addition, we create a callback function that will be called for each iteration of the optimizer. It needs two parameters,
which are the current weights and the value of the objective function at those weights.
Input:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Import dataset
data = '../data/datasets/neurons_binary.csv'
neuron = pd.read_csv(data, delimiter=';')
# Divide the data, y the variable to predict (Target) and X the features
X = df[df.columns[1:]].to_numpy() # We remove labels and convert pandas DataFrame
into numpy array
y = df['Target'].replace(0, -1).to_numpy() # We replace our labels by 1 and -1 and
convert pandas DataFrame into numpy array
# Transform data
X_train = Normalize.fit_transform(X_train)
X_test = Normalize.fit_transform(X_test)
# Transform data
X_train = pca.fit_transform(X_train)
X_test = pca.fit_transform(X_test)
# Variable definition
feature_dimension = 2 # Number of qubits
quantum_backend = None # We use local simulator
reps = 2 # Number of repetitions
# Get backend
backend = service.backend(quantum_backend) # Use a simulator or hardware from the
cloud
# We use the reference implementation of the Estimator primitive.
from qiskit_ibm_runtime import Options
options = Options()
options.resilience_level = 1
options.execution.shots = 1024
options.optimization_level = 3
estimator = Estimator(session=backend, options = options)
else:
from qiskit.primitives import Estimator
estimator = Estimator()
# construct ansatz
from qiskit.circuit.library import RealAmplitudes
ansatz = RealAmplitudes(feature_dimension, reps=reps)
# Build QNN
from qiskit_machine_learning.neural_networks import EstimatorQNN
estimator_qnn = EstimatorQNN(
circuit=qc,
input_params=feature_map.parameters,
weight_params=ansatz.parameters,
estimator = estimator
)
# create empty array for callback to store evaluations of the objective function
objective_func_vals = []
# callback function that draws a live plot when the .fit() method is called
from IPython.display import clear_output
def callback_graph(weights, obj_func_eval):
clear_output(wait=True)
objective_func_vals.append(obj_func_eval)
plt.title("Objective function value against iteration")
plt.xlabel("Iteration")
plt.ylabel("Objective function value")
plt.plot(range(len(objective_func_vals)), objective_func_vals)
plt.show()
5.5 Quantum Neural Networks 341
# score classifier
estimator_classifier.score(X_train, y_train)
# score classifier
estimator_classifier.score(X_test, y_test)
# plot results
# red == wrongly classified
for x, y_target, y_p in zip(X_train, y_train, y_predict):
if y_target == 1:
plt.plot(x[0], x[1], "bo")
else:
plt.plot(x[0], x[1], "go")
if y_target != y_p:
plt.scatter(x[0], x[1], s=200, facecolors="none", edgecolors="r", linewidths=2)
plt.plot([-1, 1], [1, -1], "–", color="black")
plt.show()
print(estimator_classifier.score(X_test, y_test))
metrics_dataframe
Output:
Objective function value against iteration
1.4
1.2
Objective function value
1.0
0.8
0.6
0.4
0 100 200 300 400 500 600 700
Iteration
1.00
0.75
0.50
0.25
0.00
–0.25
–0.50
–0.75
–1.00
[ 1.]
[-1.]
[-1.]
[-1.]
[ 1.]
[-1.]
[ 1.]
[ 1.]]
[-1 1 1 -1 1 -1 1 -1 -1 1 -1 1 1]
Classification Report:
accuracy 0.77 13
macro avg 0.77 0.77 0.77 13
weighted avg 0.78 0.77 0.77 13
opflow_classifier
Input:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Import dataset
data = '../data/datasets/neurons_binary.csv'
neuron = pd.read_csv(data, delimiter=';')
# Divide the data, y the variable to predict (Target) and X the features
X = df[df.columns[1:]].to_numpy() # We remove labels and convert pandas DataFrame
into numpy array
y = df['Target'].to_numpy()
# Variable definition
number_classes = 2 # Number of classes
feature_dimension = 2 # Number of qubits
quantum_backend = None # We use local simulator
reps = 2
# construct ansatz
from qiskit.circuit.library import RealAmplitudes
ansatz = RealAmplitudes(feature_dimension, reps=reps)
# Build QNN
from qiskit_machine_learning.neural_networks import SamplerQNN
sampler_qnn = SamplerQNN(
circuit=qc,
input_params=feature_map.parameters,
weight_params=ansatz.parameters,
output_shape=number_classes,
sampler=sampler
)
# construct classifier
from qiskit_machine_learning.algorithms.classifiers import NeuralNetworkClassifier
from qiskit.algorithms.optimizers import COBYLA
sampler_classifier = NeuralNetworkClassifier(
neural_network=sampler_qnn, optimizer=COBYLA(maxiter=30), callback=callback_graph
)
# create empty array for callback to store evaluations of the objective function
objective_func_vals = []
# score classifier
sampler_classifier.score(X_train, y_train)
346 5 Machine Learning Algorithms in Quantum Computing
# callback function that draws a live plot when the .fit() method is called
def callback_graph(weights, obj_func_eval):
clear_output(wait=True)
objective_func_vals.append(obj_func_eval)
plt.title("Objective function value against iteration")
plt.xlabel("Iteration")
plt.ylabel("Objective function value")
plt.plot(range(len(objective_func_vals)), objective_func_vals)
plt.show()
# plot results
# red == wrongly classified
for x, y_target, y_p in zip(X_test, y_test, y_predict):
if y_target == 1:
plt.plot(x[0], x[1], "bo")
else:
plt.plot(x[0], x[1], "go")
if y_target != y_p:
plt.scatter(x[0], x[1], s=200, facecolors="none", edgecolors="r", linewidths=2)
plt.plot([-1, 1], [1, -1], "--", color="black")
plt.show()
metrics_dataframe
5.5 Quantum Neural Networks 347
Output:
Objective function value against iteration
0.70
0.65
0.60
Objective function value
0.55
0.50
0.45
0.40
0.35
0.30
1.00
0.75
0.50
0.25
0.00
–0.25
–0.50
–0.75
–1.00
Classification Report:
accuracy 0.85 13
macro avg 0.85 0.85 0.85 13
weighted avg 0.85 0.85 0.85 13
348 5 Machine Learning Algorithms in Quantum Computing
circuit_classifier
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Import dataset
data = '../data/datasets/neurons_binary.csv'
#data = '../data/datasets/neurons.csv'
neuron = pd.read_csv(data, delimiter=';')
# Divide the data, y the variable to predict (Target) and X the features
X = df[df.columns[1:]].to_numpy() # We remove labels and convert pandas DataFrame
into numpy array
y = df['Target'].to_numpy()
X_train = Normalize.fit_transform(X_train)
X_test = Normalize.fit_transform(X_test)
# Variable definition
number_classes = 2 # Number of classes
# Variable definition
feature_dimension = 2 # Number of qubits
quantum_backend = None # We use local simulator
reps = 2
# Create empty array for callback to store evaluations of the objective function
objective_func_vals = []
# score classifier
vqc.score(X_train, y_train)
# plot results
# red == wrongly classified
for x, y_target, y_p in zip(X_test, y_test, y_predict):
if y_target[0] == 1:
plt.plot(x[0], x[1], "bo")
else:
plt.plot(x[0], x[1], "go")
if not np.all(y_target == y_p):
plt.scatter(x[0], x[1], s=200, facecolors="none", edgecolors="r", linewidths=2)
plt.plot([-1, 1], [1, -1], "–", color="black")
plt.show()
print(vqc.score(X_train, y_train))
Output:
1.4
Objective function value
1.2
1.0
0.8
0.6
1.00
0.75
0.50
0.25
0.00
–0.25
–0.50
–0.75
–1.00
0.9655172413793104
5.5.4 Regression
Regression can also be performed with similar methodology to VQC, EstimatorQNN, or SamplerQNN.
For example, we can use VQR instead, which is a special variant of the NeuralNetworkRegressor with an EstimatorQNN.
It will consider the l2 loss function to minimize the mean squared error between targets and predictions.
In regression, we will use lines of code such as the following:
# construct QNN
regression_estimator_qnn = EstimatorQNN(
circuit=qc, input_params=feature_map.parameters, weight_params=ansatz.parameters
)
# fit to data
regressor.fit(X, y)
vqr = VQR(
feature_map=feature_map,
ansatz=ansatz,
optimizer=L_BFGS_B(maxiter=5),
callback=callback_graph,
)
352 5 Machine Learning Algorithms in Quantum Computing
To produce new information from a given dataset, one possibility is to employ a quantum generative adversarial network
(qGAN) combined with a hybrid quantum-classical algorithm specifically created for generative modeling tasks, as
presented by Zoufal et al. (2019). This method takes advantage of the interplay between a quantum generator Gθ (also
known as an ansatz) and a classical discriminator Dϕ (a type of neural network) to grasp the fundamental probability
distribution of the supplied training data. The training process for the generator and discriminator involves a series
of alternating optimization steps. The generator’s aim is to generate samples that the discriminator will identify as
authentic training data samples (i.e., samples originating from the actual training distribution), while the discriminator
endeavors to differentiate between genuine training data samples and those produced by the generator (essentially, dis-
tinguishing real and generated distributions). The ultimate objective is for the quantum generator to capture a represen-
tation of the basic probability distribution of the training data. As a result, the trained quantum generator can be
employed to load a quantum state, which acts as an approximate model of the target distribution. The qGAN is a method
used to learn the fundamental probability distribution of a collection of k-dimensional data samples and load it directly
into a quantum state:
2n − 1
gθ = pjθ j
j=0
The qGAN training produces a state | gθ , which is composed of the occurrence probabilities of the basis states | j
j
defined by pθ . The goal is to generate a probability distribution that closely resembles the distribution underpinning the
training data X = {x0,…,k−1}, with j {0,…,2n−1}. To execute this algorithm, several steps must be followed in order to generate
data using qGANs. To simplify the representation, the samples are converted into discrete values. The number of discrete
values that can be represented depends on the quantity of qubits employed for mapping. As a result, the data’s resolution is
determined by the number of qubits utilized. For example, if 3 qubits are used to represent one feature, 23 = 8 discrete values
can be generated. In the example provided below, the training data was discretized using an array [5, 5], which indicates the
number of qubits used to represent each data dimension, converted to PyTorch modeling, starting with the transformation
of data arrays into tensors, and then creating a data loader from the training data. Following this, a backend (quantum
hardware or simulator) is selected to operate the quantum generator. A quantum instance is then established for evaluation
purposes, choosing 10,000 shots to obtain more comprehensive insights and launching the parameterized quantum circuit G
(θ), where θ = θ1,...,θk, which will be used in the quantum generator. To implement the quantum generator, a depth-2 ansatz
is chosen that employs RY rotations and CX gates and accepts a uniform distribution as input state. It is crucial to recognize
that for k > 1, the generator’s parameters must be selected carefully. For instance, the circuit depth should be more than 1 to
enable the representation of more complex structures. Next, a function is created that generates the quantum generator
using a specified parameterized quantum circuit. This function takes a quantum instance for data sampling as its para-
meters. The TorchConnector is utilized to encapsulate the created quantum neural network, facilitating PyTorch-based
training. A classical neural network, representing the classical discriminator, is then constructed using PyTorch. The under-
lying gradients can be computed automatically through PyTorch’s built-in features. Both the generator and the discrimi-
nator are trained using binary cross-entropy as the loss function:
Lθ = pj θ yj log x j + 1 − yj log 1 − x j
j
∂L θ ∂pj θ
= yj log x j + 1 − yj log 1 − x j
∂θl j
∂θl
5.6 Quantum Generative Adversarial Network 353
We determine the relative entropy between the target and trained distributions, which serves as a measure of distance
between probability distributions, to evaluate the closeness or divergence between the trained and target distributions. The
ADAM optimizer can be employed to train both the generator and the discriminator.
It is time to start programming. We will be working with the following dataset available on GitHub at this link: https://
github.com/xaviervasques/hephaistos/blob/main/data/datasets/e-type.csv.
The Jupyter Notebook of the code below can be found at this link: https://github.com/xaviervasques/hephaistos/blob/
main/Notebooks/q_GAN_Pipeline.ipynb.
Input:
# Set the random seed for PyTorch to ensure reproducibility in the generated random
numbers.
torch.manual_seed(42)
# Set the random seed for Qiskit's algorithm_globals to ensure reproducibility in the
generated random numbers used in Qiskit algorithms.
algorithm_globals.random_seed = 42
# Define the file path for the dataset as a string variable named 'mtype'.
mtype = '../data/datasets/e-type.csv'
# Load the dataset using pandas 'read_csv' function, specifying the delimiter as ';',
and store it in a DataFrame named 'df'.
df = pd.read_csv(mtype, delimiter=';')
# Display the first 5 rows of the DataFrame 'df' using the 'head' function.
df.head()
Output:
Target adaptation avg_isi electrode_0_pa f_i_curve_slope fast_trough_t_long_square fast_trough_t_ramp fast_trough_t_short_square fast_trough_v_long_s
5 rows × 49 columns
354 5 Machine Learning Algorithms in Quantum Computing
Input:
# Apply the 'fit_transform' method of the 'labelencoder' object to the 'Target' column
of the DataFrame 'df',
# converting the categorical values to numerical values, and store the results back in
the 'Target' column.
df['Target'] = labelencoder.fit_transform(df['Target'])
# Display the first 5 rows of the DataFrame 'df' using the 'head' function to show the
updated 'Target' column.
df.head()
Output:
Target adaptation avg_isi electrode_0_pa f_i_curve_slope fast_trough_t_long_square fast_trough_t_ramp fast_trough_t_short_square fast_trough_v_long_s
5 rows × 49 columns
Input:
# Import the KNNImputer class from the sklearn.impute module for handling missing
values using k-Nearest Neighbors.
from sklearn.impute import KNNImputer
# Instantiate a KNNImputer object with 5 nearest neighbors and store it in the
variable 'KNN_imputer'.
KNN_imputer = KNNImputer(n_neighbors=5)
# Apply the 'fit_transform' method of the 'KNN_imputer' object to the entire DataFrame
'df',
# imputing missing values based on the 5 nearest neighbors, and store the results back
in 'df'.
df.iloc[:, :] = KNN_imputer.fit_transform(df )
# Display the first 5 rows of the DataFrame 'df' using the 'head' function to show the
imputed values.
df.head()
# Select only one class
# Set the type_number variable to the class label you want to filter for
type_number = 0
# Filter the df_train DataFrame to include only instances where the 'Target' column
# is equal to the specified type_number
df = df[df['Target'] == type_number]
# Number of classes (starts from 0)
5.6 Quantum Generative Adversarial Network 355
# Determine the number of unique classes in the 'Target' column by using the 'groupby'
method followed by 'size' and 'index',
# and store the result in the variable 'number_classes'.
number_classes = len(df.groupby('Target').size().index)
# Divide the data, y the variable to predict (Target) and X the features
# Assign all columns except the 'Target' column from the DataFrame 'df' to the
variable 'X'.
X = df[df.columns[1:]]
# Assign the 'Target' column from the DataFrame 'df' to the variable 'y'.
y = df['Target']
# Split the data into training and testing sets using 'train_test_split' with a test
size of 0.5 (50%),
# stratifying the split based on 'y_all' to ensure a balanced distribution of classes,
and set the random state to 42 for reproducibility.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, stratify=y,
random_state=42)
X_train
Output:
adaptation avg_isi electrode_0_pa f_i_curve_slope fast_trough_t_long_square fast_trough_t_ramp fast_trough_t_short_square fast_trough_v_long_squ
24 rows × 48 columns
356 5 Machine Learning Algorithms in Quantum Computing
Input:
# Fit the NCA model to the feature data (X) and the target variable (y)
# and transform the data using the learned model
# The transformed data (training_data) will have 2 components, as specified
# by n_components
training_data = extraction.fit(X_train, y_train).transform(X_train)
Output:
[ [ 1.18632791 –3.36745199]
[ 0.68146915 –0.7093939 ]
[ 0.24404409 –0.78579509]
[–1.42294384 5.49353715]
[–1.28454583 1.40429752]
[–0.57685353 –0.34843588]
[–0.40164212 0.10773078]
[13.65363451 1.93219055]
[ 6.57909835 –3.09613086]
[ 0.45864905 –2.99663493]
[14.43916508 1.37774594]
[–1.65068257 2.86286648]
[–2.10741473 3.93706039]
[ 0.72275431 –1.66833441]
[ 1.07689893 –0.31745944]
[ 0.37896589 –2.47945056]
[–0.17086296 –2.33438325]
[–2.18486917 3.46464634]
[–0.06455965 4.93661457]
[–1.0877558 3.32550378]
[ 0.05539514 –2.20910866]
[17.13972368 2.15327072]
[ 0.27363331 –1.69303335]
[–0.66780033 1.23111316] ]
5.6 Quantum Generative Adversarial Network 357
Input:
# Determine data resolution for each dimension of the training data in terms
# of the number of qubits used to represent each data dimension
data_dim = [5, 5]
Output:
Input:
# Print training_data
training_data
358 5 Machine Learning Algorithms in Quantum Computing
Output:
array ( [ [ 0.59984354, –0.79698094] ,
[ 0.07209385, –0.79698094] ,
[–1.51115522, 1.48724458] ,
[–0.45565584, –0.28937527] ,
[–0.45565584, 0.2182304 ] ,
[13.7935858, 1.99485025] ,
[ 6.40509014, –3.08120647] ,
[ 0.59984354, –3.08120647] ,
[14.32133549, 1.48724458] ,
[–1.51115522, 2.75625876] ,
[–2.03890491, 4.02527294] ,
[ 0.59984354, –1.55838945] ,
[ 1.12759323, –0.28937527] ,
[ 0.59984354, –2.5736008 ] ,
[ 0.07209385, –2.31979796] ,
[–2.03890491, 3.51766727] ,
[–0.98340553, 3.26386443] ,
[ 0.07209385, –2.31979796] ,
[ 0.07209385, –1.81219229] ,
[–0.45565584, 1.23344174] ] )
Input:
# Convert training_data and grid_elements to PyTorch tensors with a float data type
training_data = torch.tensor(training_data, dtype=torch.float)
grid_elements = torch.tensor(grid_elements, dtype=torch.float)
Output:
Histogram of the first variable
5
3
Counts
0
–2 0 2 4 6 8 10 12 14
Values
Histogram of the second variable
3.0
2.5
2.0
Counts
1.5
1.0
0.5
0.0
–3 –2 –1 0 1 2 3 4
Values
Input:
# Save and load the IBMQ account with the provided API key
IBMQ.save_account('YOUR API', overwrite=True)
IBMQ.load_account()
360 5 Machine Learning Algorithms in Quantum Computing
# Get the provider and the backend for the quantum simulations
provider = IBMQ.get_provider(hub='YOUR PROVIDER')
backend = provider.get_backend('YOUR BACKEND')
# Create QuantumInstance objects for training and sampling, setting the number of shots
qi_training = QuantumInstance(backend, shots=batch_size)
qi_sampling = QuantumInstance(backend, shots=10000)
# Assuming data_dim is already defined as a list of data resolutions for each dimension
# sum(data_dim) corresponds to the total number of qubits in our quantum circuit (qc)
qc = QuantumCircuit(sum(data_dim))
# Compose the quantum circuit (qc) with the TwoLocal object (twolocal)
qc.compose(twolocal, inplace=True)
# Draw the decomposed quantum circuit using the matplotlib (mpl) backend
qc.decompose().draw("mpl")
Output:
q0 U2 RY
θ [0]
RY
θ [10]
RY
θ [20]
0, π
q1 U2 RY
θ [1]
RY
θ [11]
RY
θ [21]
0, π
q2 U2
0, π
RY
θ [2]
RY
θ [12]
RY
θ [22]
q3 U2 RY
θ [3]
RY
θ [13]
RY
θ [23]
0, π
q4 U2
0, π
RY
θ [4]
RY
θ [14]
RY
θ [24]
q5 U2
0, π
RY
θ [5]
RY
θ [15]
RY
θ [25]
q6 U2
0, π
RY
θ [6]
RY
θ [16]
RY
θ [26]
q7 U2 RY
θ [7]
RY
θ [17]
RY
θ [27]
0, π
q8 U2
0, π
RY
θ [8]
RY
θ [18]
RY
θ [28]
q9 U2
0, π
RY
θ [9]
RY
θ [19]
RY
θ [29]
5.6 Quantum Generative Adversarial Network 361
Input:
return TorchConnector(circuit_qnn)
loss_grad = ()
for j, grad in enumerate(grads):
cx = grad[0].tocoo()
input = torch.zeros(len(cx.col), len(data_dim))
target = torch.ones(len(cx.col), 1)
weight = torch.zeros(len(cx.col), 1)
# Define a function to calculate the relative entropy between generated data and true
data
def get_relative_entropy(gen_data) -> float:
prob_gen = np.zeros(len(grid_elements))
for j, item in enumerate(grid_elements):
for gen_item in gen_data.detach().numpy():
if np.allclose(np.round(gen_item, 6), np.round(item, 6), rtol=1e-5):
prob_gen[j] += 1
prob_gen = prob_gen / len(gen_data)
prob_gen = [1e-8 if x == 0 else x for x in prob_gen]
return entropy(prob_gen, prob_data)
clear_output(wait=True)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(18, 6))
# Loss
ax1.set_title("Loss")
ax1.plot(generator_loss_values, label="generator loss", color="royalblue")
ax1.plot(discriminator_loss_values, label="discriminator loss", color="magenta")
ax1.legend(loc="best")
ax1.set_xlabel("Iteration")
ax1.set_ylabel("Loss")
ax1.grid()
# Relative Entropy
ax2.set_title("Relative entropy")
ax2.plot(relative_entropy_values)
ax2.set_xlabel("Iteration")
ax2.set_ylabel("Relative entropy")
ax2.grid()
plt_name = 'Loss_Relative_Entropy_%i.png'%type_number
fig
# Initialize lists for storing relative entropy, generator loss, and discriminator
loss values
relative_entropy_values = []
generator_loss_values = []
discriminator_loss_values = []
# Training loop
for epoch in range(num_epochs):
relative_entropy_epoch = []
generator_loss_epoch = []
discriminator_loss_epoch = []
for i, data in enumerate(dataloader):
# Adversarial ground truths
valid = torch.ones(data.size(0), 1)
fake = torch.zeros(data.size(0), 1)
# Train Discriminator
optimizer_disc.zero_grad()
364 5 Machine Learning Algorithms in Quantum Computing
discriminator_loss.backward(retain_graph=True)
optimizer_disc.step()
# Train Generator
optimizer_gen.zero_grad()
# generator_loss.backward(retain_graph=True)
for j, param in enumerate(generator.parameters()):
param.grad = g_loss_grad
optimizer_gen.step()
generator_loss_epoch.append(generator_loss.item())
discriminator_loss_epoch.append(discriminator_loss.item())
relative_entropy_values.append(np.mean(relative_entropy_epoch))
generator_loss_values.append(np.mean(generator_loss_epoch))
discriminator_loss_values.append(np.mean(discriminator_loss_epoch))
Output:
Loss
Generator loss
1.2 Discriminator loss
1.1
1.0
Loss
0.9
0.8
0.7
0.6
0 20 40 60 80 100
Iteration
5.6 Quantum Generative Adversarial Network 365
Relative entropy
21
20
Relative entropy
19
18
17
16
0 20 40 60 80 100
Iteration
Input:
# Plot the cumulative distribution function for generated and training data
fig = plt.figure(figsize=(12, 12))
ax1 = fig.add_subplot(111, projection="3d")
ax1.set_title("Cumulative Distribution Function")
ax1.bar3d(
np.transpose(grid_elements)[1],
np.transpose(grid_elements)[0],
np.zeros(len(prob_gen)),
0.05,
0.05,
np.cumsum(prob_gen),
label="generated data",
color="blue",
alpha=1,
)
366 5 Machine Learning Algorithms in Quantum Computing
ax1.bar3d(
np.transpose(grid_elements)[1] + 0.05,
np.transpose(grid_elements)[0] + 0.05,
np.zeros(len(prob_data)),
0.05,
0.05,
np.cumsum(prob_data),
label="training data",
color="orange",
alpha=1,
)
ax1.set_xlabel("x_1")
ax1.set_ylabel("x_0")
ax1.set_zlabel("p(x)")
plt_2_name ='Cumulative_Distribution_Function_%i.png'%type_number
#plt.savefig(plt_2_name)
fig
Output:
Cumulative distribution function
1.0
0.8
0.6
p(x)
0.4
0.2
0.0
15.0
12.5
10.0
7.5
–3 x_0
–2 5.0
–1
0 2.5
1
2 0.0
x_1 3
4
5 –2.5
Input:
# Create a DataFrame with the back-scaled data and the column names
data = pd.DataFrame(X_orig_backscaled, columns=[
'adaptation',
'avg_isi',
'electrode_0_pa',
'f_i_curve_slope',
'fast_trough_t_long_square',
'fast_trough_t_ramp',
'fast_trough_t_short_square',
'fast_trough_v_long_square',
'fast_trough_v_ramp',
'fast_trough_v_short_square',
'input_resistance_mohm',
'latency',
'peak_t_long_square',
'peak_t_ramp',
'peak_t_short_square',
'peak_v_long_square',
'peak_v_ramp',
'peak_v_short_square',
'ri',
'sag',
'seal_gohm',
'slow_trough_t_long_square',
'slow_trough_t_ramp',
'slow_trough_t_short_square',
'slow_trough_v_long_square',
'slow_trough_v_ramp',
'slow_trough_v_short_square',
'tau',
'threshold_i_long_square',
'threshold_i_ramp',
'threshold_i_short_square',
'threshold_t_long_square',
'threshold_t_ramp',
'threshold_t_short_square',
'threshold_v_long_square',
'threshold_v_ramp',
'threshold_v_short_square',
'trough_t_long_square',
'trough_t_ramp',
'trough_t_short_square',
'trough_v_long_square',
'trough_v_ramp',
'trough_v_short_square',
'upstroke_downstroke_ratio_long_square',
'upstroke_downstroke_ratio_ramp',
'upstroke_downstroke_ratio_short_square',
'vm_for_sag',
'vrest'
])
368 5 Machine Learning Algorithms in Quantum Computing
A CSV file called data_gan_0.csv will be created, containing the freshly generated data.
To execute the following example, create a Python file within the hephAIstos directory and proceed to run it. Quantum
machine learning algorithms and processes such as those described above and including q_kernel_default, q_kernel_8,
q_kernel_9, q_kernel_10, q_kernel_11, q_kernel_12, q_kernel_zz, q_kernel_training, q_kernel_8_pegasos, q_kernel_9_pe-
gasos, q_kernel_10_pegasos, q_kernel_11_pegasos, q_kernel_12_pegasos, q_circuitqnn, q_twolayerqnn, and q_vqc can be
easily handled using hephAIstos. For q_kernel_training, we need to adapt several parameters that are in the code such as
the setup of the optimizer:
The rotational layer for training and the number of circuits must also be specified:
# Create a rotational layer to train. We will rotate each qubit the same amount.
training_params = ParameterVector("θ", 1)
fm0 = QuantumCircuit(feature_dimension)
for qubit in range(feature_dimension):
fm0.ry(training_params[0], qubit)
Let us view some examples. We will run the exact procedure we have run above with the q_kernel_zz algorithm, which
includes data rescaling with the quantile uniform technique, label encoding of the “Target” column, repetition (the number
of times the feature map circuit is repeated) set to 2, the use of the statevector simulator, the selection of five features using
embedded decision tree, the use of five qubits, and a fivefold cross-validation to measure the quality of our model.
Input:
#!/usr/bin/python3
from ml_pipeline_function import ml_pipeline_function
import pandas as pd
scale = ['quantile_uniform']
selection = ['embedded_decision_tree_classifier']
kernel = ['q_kernel_zz']
The above code will produce the same results as previously obtained. We could also add the multiclass option (multi-
class = ‘OneVsRestClassifier’ or ‘OneVsOneClassifier’ or ‘SVC’) if we want to pass our quantum kernel
to SVC from scikit-learn or “None” if we want to use QSVC from Qiskit.
For pegasos algorithms, we need to add the following options:
If we want to create a pipeline to benchmark different quantum algorithms with a combination of feature rescaling and
feature selection techniques, we can use the following code:
#!/usr/bin/python3
from ml_pipeline_function import ml_pipeline_function
import pandas as pd
# Dataset
from data.datasets import neurons_maha_soma
neuron = neurons_maha_soma()
scale = [
'standard_scaler',
'minmax_scaler',
'maxabs_scaler',
'robust_scaler',
'normalizer',
'log_transformation',
'square_root_transformation',
'reciprocal_transformation',
'box_cox',
'yeo_johnson',
'quantile_gaussian',
'quantile_uniform',
]
selection = [
'variance_threshold',
'chi_square',
'anova_f_c',
'pearson',
'forward_stepwise',
'backward_elimination',
'exhaustive',
'lasso',
'feat_reg_ml',
'embedded_linear_regression',
'embedded_logistic_regression',
'embedded_random_forest_classifier',
'embedded_decision_tree_classifier',
'embedded_xgboost_classification',
]
If we want to create a pipeline to benchmark different quantum algorithms with a combination of feature rescaling and
feature extraction techniques, we can use the following code:
#!/usr/bin/python3
from ml_pipeline_function import ml_pipeline_function
import pandas as pd
# Dataset
from data.datasets import neurons_maha_soma
neuron = neurons_maha_soma()
scale = [
'standard_scaler',
'minmax_scaler',
'maxabs_scaler',
'robust_scaler',
'normalizer',
'log_transformation',
'square_root_transformation',
'reciprocal_transformation',
'box_cox',
'yeo_johnson',
'quantile_gaussian',
'quantile_uniform',
]
extraction = [
'pca',
'ica',
372 5 Machine Learning Algorithms in Quantum Computing
'icawithpca',
'lda_extraction',
'random_projection',
'truncatedSVD',
'isomap',
'standard_lle',
'modified_lle',
'hessian_lle',
'ltsa_lle',
'mds',
'spectral',
'tsne',
'nca'
]
References
Blacoe, W., Kashefi, E., and Lapata, M. (2013). A quantum-theoretic approach to distributional semantics. In: Proceedings of the
2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,
pp. 847–857. Atlanta, Georgia: Association for Computational Linguistics.
Ezhov, A. and Ventura, D. (2000). Quantum neural networks. In: Future Directions for Intelligent Systems and Information Sciences
(ed. N. Kasabov), pp. 213–235.
J.R. Glick, T.P. Gujarati, A.D. Corcoles et al. (2021). Covariant quantum Kernels for data with group structure. https://arxiv.org/
pdf/2105.03406.pdf.
Grover, L.K (1996). A fast quantum mechanical algorithm for database search. In: Proceedings of the Twenty-Eighth Annual ACM
Symposium on Theory of Computing, pp. 212–219.
Havlíček, V., Córcoles, A.D., Temme, K. et al. (2019). Supervised learning with quantum-enhanced feature spaces. Nature 567
(7747): 209–212.
Liu, Z., Liu, Y., Yang, Y. et al. (2017). Subthalamic nuclei stimulation in patients with pantothenate kinase-associated
neurodegeneration (PKAN). Neuromodulation 20 (5): 484–491.
Lloyd, S., Mohseni, M., and Rebentrost, P. (2013). Quantum algorithms for supervised and unsupervised machine learning.
https://arxiv.org/abs/1307.0411.
Rebentrost, P., Mohseni, M., and Lloyd, S. (2013). Quantum support vector machine for big data classification. Physical Review
Letters, American Physical Society 113 (13): (revised in 2014). https://doi.org/10.1103/physrevlett.113.130503.
Ricks, B. and Ventura, D. (2003). Training a quantum neural network. Neural Information Processing Systems 1019–1026.
Shor, P.W. (1994). Algorithms for quantum computation: discrete logarithms and factoring. In: Proceeding of the 35th Annual
Symposium on Findations of Computer Science IEEE, pp. 124–134.
Further Reading 373
Further Reading
Biamonte, J., Wittek, P., Pancotti, N. et al. (2018). Quantum machine learning. Nature 549 (7671): 195–202.
Bishwas, A.K., Mani, A., and Palade, V. (2018). An all-pair quantum SVM approach for big data multiclass classification. Quantum
Information Processing 17: 282.
Coles, P.J. (2021). Seeking quantum advantage for neural networks. Nature Computational Science 1: 389–390.
Farhi, E. and Neven, H. (2018). Classification with quantum neural networks on near term processors. https://arxiv.org/abs/
1802.06002.
Schuld, M. (2021). Supervised quantum machine learning models are Kernel methods. https://arXiv:2101.11020.
Suzuki, Y., Yano, H., Gao, Q. et al. (2020). Analysis and synthesis of feature map for Kernel-based quantum classifier.
https://arXiv:1906.10467.
Schuld, M. and Killoran, N. (2018). Quantum machine learning in feature Hilbert spaces. Physical Review Letters 122: 040504.
Vapnik, V.N. (2000). The Nature of Statistical Learning Theory. Book series: Information Science and Statistics (ISS). Springer.
Zoufal, C., Lucchi, A., and Woerner, S. (2019). Quantum generative adversarial networks for learning and loading random
distributions. NPJ Quantum Information 5 (1): 103.
375
How many artificial intelligence (AI) models that have been created have been put into production? With investment in
data science teams and technologies, the number of AI projects has increased significantly and with it a number of missed
opportunities to put them into production and assess their true business value. The goal of this chapter is to provide a brief
introduction to several technologies that can help bring our AI models to life.
Even though there are different containerization technologies, we will choose Docker to explain why containerization of
machine learning applications is important. Docker is an open-source technology that allows packaging of applications into
containers.
Machine Learning Theory and Applications: Hands-on Use Cases with Python on Classical and Quantum Machines, First Edition. Xavier Vasques.
© 2024 John Wiley & Sons, Inc. Published 2024 by John Wiley & Sons, Inc.
376 6 Machine Learning in Production
Each service or feature of the application is isolated in a way that we can scale or update without impacting other application
features. To put machine learning into production, let us consider that the application needs to be broken down into smaller
microservices such as ingestion, preparation, combination, separation, training, evaluation, inference, postprocessing, and
monitoring.
6.1.2 Containerization
Microservice architecture also has its drawbacks. When developing a machine learning application in one server, we
will require the same number of virtual machines (VMs) as microservices containing dependencies. Each VM will need
an operating system (OS), libraries, and binaries and will consume more hardware resources such as processor, memory,
and disk space even if the microservice is not running. This is where Docker comes in. If a container is not running, the
remaining resources become shared resources and accessible to other containers. We do not need to add an OS in a con-
tainer. Let us consider an entire solution composed of applications 1 and 2 (APP 1 and APP 2, respectively). If we want to
scale out the APP 1 or add other applications as shown in the scheme below, we can be limited using VMs instead of contain-
ers by the resources available. If we decide to scale out, only APP 1 and not APP 2 (keeping only a single one), APP 2 becomes
“share” of all container processes.
VMs Docker
VM1 VM2 VM3
App 1
App 1
App 1
App 1
App 1
App 1
App 1
Host OS Host OS
Hardware Hardware
Hardware resources
consumption
6.1.3 Docker and Machine Learning: Resolving the “It Works in My Machine” Problem
Creating a machine learning model that works on our computer is not really complicated. But when we work, for example,
with a customer who wants to use a model that can scale and function in all types of servers across the globe, it is more
challenging. After developing our model, it might run perfectly well on our laptop or server but not really on other systems
such as when we move the model to the production stage or another server. Many problems can occur, including
performance issues, application crashes, or poor optimization. The other challenging situation is that our machine learning
model can certainly be written with one single programming language such as Python, but the application will also certainly
need to interact with other applications written in other programming languages for data ingestion, data preparation,
front-end, and so on. Docker allows better management of all these interactions as each microservice can be written in
a different language, allowing scalability and the straightforward addition or deletion of independent services. Docker
provides reproducibility, portability, easy deployment, granular updates, lightness, and simplicity.
When a model is ready, the data scientist’s worry is that the model will not reproduce the results of real life. Sometimes, it
is not because of the model but rather the need to reproduce the whole stack. Docker allows easy reproduction of the
working environment that can be used to train and run the machine learning model anywhere. Docker allows packaging
of code and dependencies into containers that can be ported to different servers even with different hardware or OSs.
A training model can be developed on a local machine and easily ported to external clusters with additional resources such
as graphics processing units (GPUs), more memory, or powerful central processing units (CPUs). It is easy to deploy and
make a model available to the globe by wrapping it into an application programming interface (API) in a container and
deploying the container using technology such as OpenShift, a Kubernetes distribution. This simplicity is also a good
6.1 Why Use Docker Containers for Machine Learning? 377
argument in favor of the containerization of machine learning applications, as we can automatically create containers with
templates and have access to an open-source registry containing existing user-contributed containers. Docker allows devel-
opers to track the different versions of a container image, check who built a version with what platform, and roll back to
previous versions. Finally, another argument is that a machine learning application can continue running even if one of its
services is updating, repairing, or down. For example, to update an output message that is embedded in the entire solution,
there is no need to update the entire application and to interfere with other services.
brew update
brew install docker
Then, we need to install the docker-machine and VirtualBox dependencies because Docker uses a Linux environment
natively:
docker --version
For CentOS or Red Hat, we go to https://download.docker.com/linux/centos and choose our version of CentOS. Then,
we browse to x86_64/stable/Packages/ and download the .rpm file for the Docker version we wish to install:
We do not need to have Python installed, as Docker will automatically download the images from the Docker hub registry
at https://hub.docker.com.
Of course, we can run any version of Python, such as 2.7:
The command tells Docker to run a new container from the python:2.7 image. The –rm flag tells Docker to remove the
container once we stop the process (ctrl+d). The -ti flag means that we can interact with the container from our terminal.
To see what Docker images are running on our computer, we can run the following from the command line:
docker images
Python alone may not be sufficient to create an environment to perform more complex tasks. For example, if we need an
interactive environment such as Jupyter Notebooks, we can use an available public image. The command to run Jupyter
Notebooks is the following:
We can access the Jupyter Notebook by copying and pasting one of the output URLs (the token from the command line).
The -p flag tells Docker to open a port from the container to a port on the host machine. Port 8888 on the left side is the port
number on the host machine, and port 8888 on the right side is the port in the container.
380 6 Machine Learning in Production
6.1.5 Dockerfile
All of the above is useful, but running bash in a container and installing what we need would be painful to replicate.
Therefore, let us get our code into a container and create our process by building a custom Docker image with a Dockerfile,
which tells Docker what we need in our application to run. Dockerfile is a text-based script of instructions.
To create a Dockerfile, we open a file named Dockerfile in our working environment. Once created, we use a
simple syntax such as the following:
FROM Specifies the base image. We can browse base images on Docker Hub
WORKDIR Changes and create the directory within the image to the specified path
RUN Runs a command in a terminal inside the image
ADD Specifies the files to add from our directory to a directory (creates if it does not exist) in the image
EXPOSE This command opens a specific port number
CMD Takes the argument for running an application
ENV Specifies environment variables specific to services
COPY Adds files
FROM jupyter/scipy-notebook
# Copy all the necessary files to the image (COPY <src> … <dest>)
COPY train.py ./train.py
COPY inference.py ./inference.py
Here, we start with the jupyter/scipy-notebook image, which is a Jupyter Notebook scientific Python stack including
popular packages from the scientific Python ecosystem such as popular Python deep learning libraries. We run pip install
joblib to install joblib. We then check our Python environment and set the working directory for containers. We copy the
train.py and inference.py scripts into the image. Then, we run the scripts. We could also run the scripts as follows:
FROM python:3.7
6.1.6 Build and Run a Docker Container for Your Machine Learning Model
The idea of this section is to perform a rapid and easy build of a Docker container with a simple machine learning model and
then run it. To start building a Docker container for a machine learning model, let us consider three files:
•• Dockerfile
train.py
• inference.py
All files can be found on GitHub (https://github.com/xaviervasques/EEG-letters). The file train.py is a Python script
that ingests and normalizes EEG data in a .csv file (train.csv) and trains two models to classify the data (using scikit-learn).
The script saves two models: linear discriminant analysis (LDA) (clf_lda) and neural network multilayer percep-
tron (clf_NN).
#!/usr/bin/python3
# train.py
# Xavier Vasques 13/04/2021
import os
import numpy as np
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neural_network import MLPClassifier
import pandas as pd
from joblib import dump
from sklearn import preprocessing
382 6 Machine Learning in Production
def train():
# Models training
# Save model
from joblib import dump
dump(clf_lda, 'Inference_lda.joblib')
# Save model
from joblib import dump, load
dump(clf_NN, 'Inference_NN.joblib')
if __name__ == '__main__':
train()
The inference.py script will be called to perform batch inference by loading the two models that have been created. The
application will normalize new EEG data coming from a .csv file (test.csv), perform inference on the dataset, and print the
classification accuracy and predictions:
#!/usr/bin/python3
# inference.py
# Xavier Vasques 13/04/2021
import os
import numpy as np
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neural_network import MLPClassifier
import pandas as pd
from joblib import dump
from sklearn import preprocessing
def inference():
# Models training
# Run model
clf_lda = load('Inference_lda.joblib')
print("LDA score and classification:")
print(clf_lda.score(X_test, y_test))
print(clf_lda.predict(X_test))
# Run model
clf_nn = load('Inference_NN.joblib')
print("NN score and classification:")
print(clf_nn.score(X_test, y_test))
print(clf_nn.predict(X_test))
if __name__ == '__main__':
inference()
Let us create a simple Dockerfile with the jupyter/scipy-notebook image as our base image. We need to install joblib to
allow serialization and deserialization of our trained model. We copy the train.csv, test.csv, train.py, and inference.py files
into the image. Then, we run train.py, which will fit and serialize the machine learning models as part of our image build
process and provide several advantages such as the ability to debug at the beginning of the process, use Docker Image ID for
keeping track, or use different versions.
FROM jupyter/scipy-notebook
We can do some other things to improve our containerization experience. For example, we can bind a host directory in the
container using WORKDIR in the Dockerfile:
FROM jupyter/scipy-notebook
WORKDIR /mydata
In inference.py, we can decide for example to save an output.csv file with the X_test data in it:
#!/usr/bin/python3
# inference.py
# Xavier Vasques 13/04/2021
import platform; print(platform.platform())
import sys; print("Python", sys.version)
6.1 Why Use Docker Containers for Machine Learning? 385
import os
import numpy as np
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neural_network import MLPClassifier
import pandas as pd
from joblib import load
from sklearn import preprocessing
def inference():
dirpath = os.getcwd()
print("dirpath = ", dirpath, "\n")
output_path = os.path.join(dirpath,'output.csv')
print(output_path,"\n")
# Models training
# Run model
clf_lda = load('Inference_lda.joblib')
print("LDA score and classification:")
print(clf_lda.score(X_test, y_test))
print(clf_lda.predict(X_test))
# Run model
clf_nn = load('Inference_NN.joblib')
print("NN score and classification:")
print(clf_nn.score(X_test, y_test))
print(clf_nn.predict(X_test))
#X_test.to_csv(output_path)
print(output_path)
pd.DataFrame(X_test).to_csv(output_path)
if __name__ == '__main__':
inference()
386 6 Machine Learning in Production
When we build and run the code above, we should be able to see the output.csv file in /mydata:
We can also add the VOLUME instruction in the Dockerfile, resulting in an image that will create a new mount point:
FROM jupyter/scipy-notebook
VOLUME /Users/Xavi/Desktop/code/data
With the name that we specify, the VOLUME instruction creates a mount point that is tagged as holding an externally
mounted volume from the native host or other containers that hold the data we want to process.
For future development, it could be necessary to set environment variables from the beginning, only once at the build
time, for persisting the trained model and perhaps adding additional data or metadata to a specific location. The advantage
of setting environment variables is to avoid the hard code of the necessary paths all over our code and to better share our
work with others on an agreed-upon directory structure.
Let us take another example, with a new Dockerfile:
FROM jupyter/scipy-notebook
#!/usr/bin/python3
# train.py
# Xavier Vasques 13/04/2021
import os
import numpy as np
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neural_network import MLPClassifier
import pandas as pd
from joblib import dump
from sklearn import preprocessing
def train():
MODEL_DIR = os.environ["MODEL_DIR"]
MODEL_FILE_LDA = os.environ["MODEL_FILE_LDA"]
MODEL_FILE_NN = os.environ["MODEL_FILE_NN"]
MODEL_PATH_LDA = os.path.join(MODEL_DIR, MODEL_FILE_LDA)
MODEL_PATH_NN = os.path.join(MODEL_DIR, MODEL_FILE_NN)
# Models training
clf_lda = LinearDiscriminantAnalysis()
clf_lda.fit(X_train, y_train)
# Save model
from joblib import dump
dump(clf_lda, MODEL_PATH_LDA)
# Record model
from joblib import dump, load
dump(clf_NN, MODEL_PATH_NN)
if __name__ == '__main__':
train()
import os
import numpy as np
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neural_network import MLPClassifier
import pandas as pd
from joblib import load
from sklearn import preprocessing
def inference():
MODEL_DIR = os.environ["MODEL_DIR"]
MODEL_FILE_LDA = os.environ["MODEL_FILE_LDA"]
MODEL_FILE_NN = os.environ["MODEL_FILE_NN"]
MODEL_PATH_LDA = os.path.join(MODEL_DIR, MODEL_FILE_LDA)
MODEL_PATH_NN = os.path.join(MODEL_DIR, MODEL_FILE_NN)
# Models training
# Run model
print(MODEL_PATH_LDA)
clf_lda = load(MODEL_PATH_LDA)
print("LDA score and classification:")
print(clf_lda.score(X_test, y_test))
print(clf_lda.predict(X_test))
# Run model
clf_nn = load(MODEL_PATH_NN)
print("NN score and classification:")
print(clf_nn.score(X_test, y_test))
print(clf_nn.predict(X_test))
if __name__ == '__main__':
inference()
The goal is to produce fast and easy steps to build a Docker container with a simple machine learning model. Building is as
simple as executing docker build -t my-docker-image.
From this step, we can begin the deployment of our models, which will be much simpler and remove the fear of publishing
and scaling the machine learning model. The next step is to produce a workflow with a continuous integration/continuous
delivery (CI/CD) tool such as Jenkins. With this approach, it will be possible to build and serve a docker container anywhere
and expose a REST API so that external stakeholders can use it. If we are training a deep learning model that needs high
computational needs, we can move the containers to high-performance computing servers or any platform of choice such as
on-premises or private or public cloud. The idea is that we can not only scale our model but also create resilient deployment,
as we can scale the container across regions or availability zones.
I hope that the great simplicity and flexibility that containers provide are clear. By containerizing a machine or deep
learning application, we can make it visible to the world. The next step is to deploy it in the cloud and expose it. At certain
times, we might need to orchestrate, monitor, and scale the containers to serve millions of users with the help of
technologies such as Red Hat OpenShift, a Kubernetes distribution.
6.2 Machine Learning Prediction in Real Time Using Docker and Python
REST APIs with Flask
The idea of this section is to perform a rapid and easy build of a Docker container to perform online inference with trained
machine learning models using Python APIs with Flask.
Batch inference is excellent when we have time to compute our predictions. Let us imagine we need real-time predictions.
In this case, batch inference is not suitable; we need online inference. Many applications would not work or would not be
very useful without online predictions such as autonomous vehicles, fraud detection, high-frequency trading, applications
based on localization data, object recognition and tracking, or brain–computer interfaces. Sometimes, the prediction needs
to be provided in milliseconds.
To understand this concept, we will implement online inferences (LDA and multilayer perceptron neural network mod-
els) with Docker and Flask-RESTful.
390 6 Machine Learning in Production
•• Dockerfile
train.py
•• api.py
requirements.txt
•• train.csv
test.json
The file train.py is a Python script that ingests and normalizes electroencephalography (EEG) data and trains two models
to classify the data. The Dockerfile will be used to build our Docker image, requirements.txt (flask, flask-restful, joblib) is
for the Python dependencies, and api.py is the script that will be called to perform the online inference using REST APIs.
The file train.csv contains the data used to train our models, and test.json is a file containing new EEG data that will be
used with our inference models. All files can be found on GitHub.
•• API 1: We will give a row number to the API, which will extract for us the data from the selected row and print it.
API 2: We will give a row number to the API, which will extract the selected row, inject the new data into the models, and
retrieve the classification prediction (letter variable in the data).
• API 3: We will ask the API to take all the data in the test.json file and instantly print the classification score of the models.
# We now need the json library so we can load and export json data
import json
import os
import numpy as np
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neural_network import MLPClassifier
import pandas as pd
from joblib import load
from sklearn import preprocessing
clf_lda = load(MODEL_PATH_LDA)
prediction_lda = clf_lda.predict(X_test)
clf_nn = load(MODEL_PATH_NN)
prediction_nn = clf_nn.predict(X_test)
data = pd.read_json('./test.json')
data_test = data.transpose()
y_test = data_test['# Letter'].values
X_test = data_test.drop(data_test.loc[:, 'Line':'# Letter'].columns, axis = 1)
clf_lda = load(MODEL_PATH_LDA)
score_lda = clf_lda.score(X_test, y_test)
clf_nn = load(MODEL_PATH_NN)
score_nn = clf_nn.score(X_test, y_test)
if __name__ == "__main__":
app.run(debug=True, host='0.0.0.0')
392 6 Machine Learning in Production
The first step, after importing dependencies including the open-source web microframework Flask, is to set the environ-
ment variables that are written in the Dockerfile. We also need to load our linear discriminant analysis and multilayer per-
ceptron neural network serialized models. We create our Flask application by writing app = Flask(__name__). Then, we
create our three Flask routes so that we can serve HTTP traffic on that route:
• http://0.0.0.0:5000/line/250: Obtain data from test.json and return the requested row defined by the variable Line (in this
example, we want to extract the data of row number 250).
• http://0.0.0.0:5000/prediction/51: Return classification predictions from models trained by both LDA and NN by
injecting the requested data (in this example, we want to inject the data of row number 51).
• http://0.0.0.0:5000/score: Return classification score for both the neural network and LDA inference models on all the
available data (test.json).
The Flask routes allow us to request what we need from the API by adding the name of our procedure (/line/<Line>,
/prediction/<int:Line>, /score) to the URL (http://0.0.0.0:5000). Whatever data we add, api.py will always return the
output we request.
#!/usr/bin/python3
# tain.py
# Xavier Vasques 13/04/2021
import os
import numpy as np
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neural_network import MLPClassifier
import pandas as pd
from joblib import dump
from sklearn import preprocessing
def train():
MODEL_DIR = os.environ["MODEL_DIR"]
MODEL_FILE_LDA = os.environ["MODEL_FILE_LDA"]
MODEL_FILE_NN = os.environ["MODEL_FILE_NN"]
MODEL_PATH_LDA = os.path.join(MODEL_DIR, MODEL_FILE_LDA)
MODEL_PATH_NN = os.path.join(MODEL_DIR, MODEL_FILE_NN)
# Models training
# Serialize model
from joblib import dump
dump(clf_lda, MODEL_PATH_LDA)
# Serialize model
from joblib import dump, load
dump(clf_NN, MODEL_PATH_NN)
if __name__ == '__main__':
train()
FROM jupyter/scipy-notebook
The -p flag exposes port 5000 in the container to port 5000 on our host machine. The -it flag allows us to see the logs from
the container, and we run python3 api.py in the my-api image.
The output is the following:
6.2 Machine Learning Prediction in Real Time Using Docker and Python REST APIs with Flask 395
We are running on http://0.0.0.0:5000/, and we can now use our web browser or the curl command to issue a POST
request to the IP address.
If we type
curl http://0.0.0.0:5000/line/232
we will get row number 232 extracted from our data (test.json):
curl http://0.0.0.0:5000/prediction/232
The above output means that the LDA model has classified the provided data (row 232) as letter 21 (U), while the mul-
tilayer perceptron neural network has classified the data as letter 8 (H). The two models do not agree.
If we type
curl http://0.0.0.0:5000/score
As we can see, we should trust the multilayer perceptron neural network more with its accuracy score of 0.59, even though
the score is not so high. There is some work to do to improve the accuracy!
I hope the simplicity of containerizing machine learning and deep learning (ML/DL) applications using Docker and Flask
to perform online inference is clear. This is an essential step when we want to put our models into production. Of course, this
396 6 Machine Learning in Production
example is a simple view, as we need to take into account many more aspects such as the network, security, monitoring,
infrastructure, and orchestration or add a database to store the data instead of using a .json file.
6.3 From DevOps to MLOPS: Integrate Machine Learning Models Using Jenkins
and Docker
How many AI models have been put into production in enterprises? With investment in data science teams and technol-
ogies, the number of AI projects has increased significantly and with it the number of missed opportunities to put them into
production and assess their real business value. One of the solutions is MLOPS, which delivers the capabilities to bring data
science and information technology (IT) operations together to deploy, monitor, and manage ML/DL models in production.
Continuous integration (CI) and continuous delivery (CD), known as the CI/CD pipeline, embody a culture with agile oper-
ating principles and practices for DevOps teams that allows software development teams to change code more frequently
and reliably or data scientists to continuously test the models for accuracy. CI/CD is a way to focus on business requirements
such as improved model accuracy, automated deployment steps, or code quality. Continuous integration is a set of practices
that drive development teams to continuously implement small changes and check in code to version-control repositories.
Today, data scientists and IT operations have different platforms at their disposal (on-premises, private and public cloud,
multi-cloud, and so on) and tools that need to be addressed by an automatic integration and validation mechanism to allow
building, packaging, and testing of applications with agility. Continuous delivery steps in when continuous integration ends
by automating the delivery of applications to selected platforms.
MLOPS = ML+DEV+OPS
ML DEV OPS
The objective of this section is to integrate machine learning models with DevOps using Jenkins and Docker. There are
many advantages to using Jenkins and Docker for ML/DL. For example, when we train a machine learning model, it is
necessary to continuously test the models for accuracy. This task can be fully automated using Jenkins. When we work
on a data science project, we usually spend some time increasing model accuracy and then, when we are satisfied, we deploy
the application to production, serving it as an API. Let us say our model accuracy is 85%. After a few days or weeks, we decide
to tune some hyperparameters and add some more data in order to improve the model accuracy. Then, we plan to deploy it
in production and to do this, we need to spend some effort to build, test, and deploy the model again, which can create
6.3 From DevOps to MLOPS: Integrate Machine Learning Models Using Jenkins and Docker 397
considerable work depending on the context and environments. This is where the open-source automation server, Jenkins,
comes in. Jenkins provides a continuous integration and continuous delivery (CI/CD) system with hundreds of plug-ins to
build, deploy, and automate software projects. There are several advantages of using Jenkins. It is easy to install and
configure, it has an important community, it contains hundreds of plug-ins, and it can distribute work across different
environments. Jenkins has one objective: to spend less time on deployment and more time on code quality. Jenkins allows
us to create Jobs, which are the nucleus of the build process in Jenkins. For example, we can create Jobs to test our data
science project with different tasks. Jenkins also offers a suite of plug-ins, Jenkins Pipeline, that supports CI/CD. They can
be both declarative and scripted pipelines.
The declarative pipeline is a feature that supports the pipeline as a code concept and is a more recent feature. It provides
richer syntactical features over scripted pipeline syntax and makes writing and reading pipeline code easier. Using the
scripting way, the pipeline is written on the Jenkins user interface instance instead of writing it on a file.
In this section, we will see how to integrate a machine learning model (linear discriminant analysis and multilayer per-
ceptron neural network) trained on EEG data using Jenkins and Docker.
To learn these concepts, let us consider the following files:
•• Dockerfile
train-lda.py
•• train-nn.py
train-auto-nn.py
•• requirements.txt
train.csv
• test.csv
The train-lda.py and train-nn.py files are Python scripts that ingest and normalize EEG data, train two models to clas-
sify the data, and test the model. The Dockerfile will be used to build our Docker image, and requirements.txt (joblib) is for
the Python dependencies. The file train-auto-nn.py is a Python script that tweaks the neural network model with different
parameters. The file train.csv contains the data used to train our models, and test.csv is a file containing new EEG data that
will be used with our inference models.
All files can be found on GitHub at https://github.com/xaviervasques/Jenkins.
Jenkins requires Java. We can install the Open Java Development Kit (OpenJDK). We can see all the needed information
to install Jenkins here: https://www.jenkins.io/doc/book/installing.
398 6 Machine Learning in Production
Jenkins can be installed on many distributions (Linux, macOS, Windows) and deployed on a private or public cloud such
as IBM Cloud or others. We can use different commands to start some Jenkins services such as the following:
Register the Jenkins service:
If everything has been set up correctly, we should see an output similar to this:
To launch Jenkins, we obtain the IP address of our server by typing hostname -I and launch our browser by entering our IP
and port: 192.168.1.107:8080.
We should see something similar to the following:
6.3 From DevOps to MLOPS: Integrate Machine Learning Models Using Jenkins and Docker 399
We copy the password and paste it into Administrator password text box and click continue.
Then, we follow some simple steps to configure the environment.
• Scenario 1: We will clone a GitHub repository automatically when someone updates the machine learning code or pro-
vides additional data. Jenkins will then automatically start the training of a model and provide the classification accuracy,
checking whether the accuracy is less than 80%.
• Scenario 2: We will perform the same task as in scenario 1 and add some additional tasks. We will automatically start the
training of a multilayer perceptron neural network model, provide the classification accuracy score, and check whether it
is less than 80%. If it is, we will run train-auto-nn.py, which will look for the best hyperparameters of our model and print
the new accuracy and the best hyperparameters.
Modifications push
Dev
Modification commit
If updated:
Start build Train model
Clone repository test model accuracy
Run container if less than 80%
PROD
Scenario 1
We will first create a container image using Dockerfile. Previous sections and articles describe how to do this: Quick Install
and First Use of Docker, and Build and Run a Docker Container for your Machine Learning Model and https://towards-
datascience.com/machine-learning-prediction-in-real-time-using-docker-and-python-rest-apis-with-flask-4235aa2395eb.
Then, we will use the build pipeline in Jenkins to create a job chain. We will use a simple model, linear discriminant
analysis coded with scikit-learn, which we will train with EEG data (train.csv). In our first scenario, we want to design
a Jenkins process in which each job will perform different tasks:
• Job #1: Pull the GitHub repository automatically when we update our code in GitHub.
400 6 Machine Learning in Production
• Job #2: Automatically start the machine learning application, train the model, and provide the prediction accuracy.
Check whether the model accuracy is less than 80%.
Jenkins uses Linux, with a user called jenkins. For our jenkins user to use the sudo command, we might want to tell the OS
not to ask for a password while executing commands. To do that, we can type the following:
This command will open sudoers file in edit mode, and you can add or modify the file as follows:
An alternative, perhaps safer, way is to create a file within the /etc/sudoers.d directory, as all files included in the directory
will be automatically processed, avoiding the modification of the sudoers file and preventing any conflicts or errors during
an upgrade. The only thing we need to do is to include this command at the bottom of the sudoers file:
#includedir /etc/sudoers.d
To create a new file in /etc/sudoers.d with the correct permissions, we use the following command:
Job #1: Pull the GitHub repository automatically when we modify our ML code in GitHub.
We click on Create a job and name it download or whatever name we choose:
In Source Code Management, we select Git and insert our repository URL and our credentials:
402 6 Machine Learning in Production
We can click on “?” to get help. As an example, H/10∗∗∗∗ means download the code from GitHub every 10 minutes. This is
not really useful for our example, so we can leave the Schedule box empty. If we leave it empty, it will only run due to SCM
changes if triggered by a post-commit hook.
Then, we click on the “Add build step” drop-down and select “Execute shell.” We type the following command to copy the
repository on GitHub to a specific path previously created:
sudo -S cp * /home/xavi/Public/Code/Kubernetes/Jenkins/code
Following the same procedure, we will create a new job. We need to go Build Triggers, click on Build after other projects are
built, and type the name of Job #1 (download in our case):
Here, we check whether my-docker-lda is already built, and we then run our container and save the accuracy of our LDA
model in a result.txt file. The next step is to check whether the accuracy of the model is less than 80% and provide the output
“yes” if this is the case or “no” otherwise. We can also send an email to provide the information: https://plugins.jenkins.io/
email-ext.
To see the outputs of the job, we simply go to Dashboard, Last Success column, select the job, and go to Console Output.
404 6 Machine Learning in Production
Scenario 2
Let us keep Job #1 and Job #2 and create a new job.
Job #3: Automatically start the neural network training, provide the prediction accuracy, and check whether accuracy is
less than 80%. If so, run a docker container to perform autoML.
6.4 Machine Learning with Docker and Kubernetes: Install a Cluster from Scratch 405
We should see in the Console Output the selected parameters and new accuracy.
Jenkins mainly involves automation of data science code.
The next steps would be to consider using Ansible with Jenkins, as Ansible could play an important role in a CI/CD pipe-
line. Ansible will perform the deployment of the application, and we would not need to worry about either how to deploy the
application or whether the environment has been properly established.
6.4 Machine Learning with Docker and Kubernetes: Install a Cluster from Scratch
Kubernetes, the open-source container orchestration platform, is certainly one of the most important tools for scaling
ML/DL efforts. To understand the utility of Kubernetes for data scientists, we can consider all the applications we have
developed and containerized. How will we coordinate and schedule all these containers? How can we upgrade our machine
learning models without interruptions of service? How do we scale the models and make them available to users over the
internet? What happens if our model is used by many more people than we had expected? If we had not thought about the
architecture before, we would need to increase the computing resources and certainly manually create new instances and
redeploy the application. Kubernetes schedules, automates, and manages tasks of container-based architectures.
Kubernetes deploys containers, updates them, and provides service discovery, monitoring, storage provisioning, load
balancing, and more.
If we search for “Kubernetes” on the internet, we will see articles comparing Docker and Kubernetes; this is like compar-
ing an apple and an apple pie. The first thing to state is that Kubernetes is designed to run on a cluster, while Docker runs on
a single node. Kubernetes and Docker are complementary in creating, deploying, and scaling containerized applications.
There is also a comparison to be made between Kubernetes and Docker Swarm, which is a tool for clustering and scheduling
Docker containers. Kubernetes has several options that provide important advantages such as high-availability policies,
autoscaling capabilities, and the possibility to manage complex and hundreds of thousands of containers running on a pub-
lic, hybrid, or multi-cloud, or on-premises environments.
All the files used in this chapter can be found on GitHub at https://github.com/xaviervasques/kubernetes.git.
Worker node 1
POD 1 POD 2 POD 3
Container 1 Container 1 Container 1
Kubectl Container 2 Container 2
Kubernetes master
CLI Container 3
Controller
Dockor engine
User interface etcd Kubelet Kube-proxy
KUBE
UI API-server Worker node 2
POD 1 POD 2 POD 3
Container 1 Container 1 Container 1
Scheduler
Dashboard Container 2 Container 2
APIs
Container 3
Dockor engine
Kubelet Kube-proxy
⋯
Figure 6.1 Typical Kubernetes architecture.
Red Hat/CentOS:
gpgkey=https://packages.cloud.google.com/yum/doc/yum-key.gpg
https://packages.cloud.google.com/yum/doc/rpm-package-key.gpg
EOF
yum install -y kubectl
Red Hat/CentOS:
For our project, we will create or edit a specific Vagrantfile to build our own environments as follows:
Vagrant.configure("2") do |config|
config.vm.define "kubmaster" do |kub|
kub.vm.box = "bento/ubuntu-20.04"
kub.vm.hostname = 'kubmaster'
kub.vm.provision "docker"
config.vm.box_url = "bento/ubuntu-20.04"
As can be read in the Vagrantfile, we are creating a master node that we name “kubmaster” with Ubuntu version 20.04,
two GB of memory and two CPUs, one IP address (192.168.56.101), and Docker. Then we create two nodes (kubnode1 and
kubnode2) with the same configuration as the master node.
Once edited, we change to root and type the following in the terminal (where the Vagrantfile is):
vagrant up
This command will create and configure guest machines according to our edited Vagrantfile. When the process is finished,
we can connect to the guest machines by using the following commands in a terminal:
We must deactivate the swap on the master node (kubmaster) and each node (kubnode1 and kubnode2). We connect to
each guest machine to perform the following commands:
swapoff -a
vim /etc/fstab
In the fstab file, we also need to comment out the swap line:
Let us also install some packages on each machine (kubmaster, kubnode1, kubnode2) such as curl and apt-trans-
port-https:
We can now perform a curl to get the gpg key that will allow us to use the Kubernetes binaries kubectl, kubeadm, and
kubelet:
We add access to the Google repository (http://apt.kubernetes.io), which will allow us to download and install the
binaries:
192.168.56.101 is the IP address of the master node we defined previously, and 10.244.0.0/16 is a mask for the Kubernetes
internal network defining the range that Kubernetes will use to assign IP addresses within its network.
We will see the following output with a token that has been generated to join our different nodes:
As can be read in the output, to start using our cluster, we need to create the configuration file to work with kubectl:
To establish the internal network, we need to provide a network between nodes in the cluster. We will use Flannel, which
is a very simple way to configure a layer 3 network fabric designed for Kubernetes. Different solutions exist, such as Wea-
veNet, Contiv, Cilium, and others. Flannel runs a binary agent called flanneld on each host. Flannel is also responsible for
allocating a subnet lease to each host out of a larger, preconfigured address space.
Veth Veth
bridge Cnto bridge Cnto
Vx LAN
We add Pods to allow management of the internal network (command to be launched in all nodes):
sysctl net.bridge.bridge-nf-call-iptables=1
6.4 Machine Learning with Docker and Kubernetes: Install a Cluster from Scratch 411
We then install our Flannel network using a configuration file (kube-flannel.yml, available online) by typing the follow-
ing command in the master node:
We can check the status of the Pods in the master node (Flannel network, kube-scheduler, kube-apiserver,
kube-controller-manager, kube-proxy, pods managing internal DNS, a Pod that stores configurations with etcd, etc.):
It is sometimes necessary to modify our configuration of flannel by editing the network (from 10.244.0.0/16 to 10.10.0.0/16).
We can enter the following for this purpose:
If we type the command below in the master node, we can see that our master is ready:
It is now time to join the nodes to the master. For this, we copy the previously generated token and type the following
command in the two nodes (kubnode1 and kubnode2):
We can return to the master node and type the following commands to check the status:
We should see the nodes with the status “Ready.” We can also execute docker ps to see all our launched containers (cor-
edns, flannel, etc.) in both the master node and the others.
One last comment pertains to cluster access. If we type in the master node:
We can see that our kubmaster and nodes have the same internal IP:
vim /etc/hosts
The /etc/hosts file also needs to be modified in each node (kubnode1 and kubenode2):
Then, we can delete each flannel by typing the following in the master node:
The magic of Kubernetes is that it can reconfigure again without losing the service and have new flannels:
Now that we know how to install a Kubernetes cluster, and we can create Kubernetes Jobs that will create Pods (and hence
containers) allowing us, for instance, to train our machine learning models, to serialize them, to load models into memory,
and to perform inferences.
6.5 Machine Learning with Docker and Kubernetes: Training Models 415
We enable traffic between the VM and the host machine. We change to root and make sure to turn off the swap and
comment out the reference swap in /etc/fstab:
swapoff -a
vim /etc/fstab
echo \
"deb [arch=amd64 signed-by=/usr/share/keyrings/docker-archive-keyring.gpg]
https://download.docker.com/linux/ubuntu \
$(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
After changing to root (sudo -s), we perform a curl to obtain the gpg key that will allow us to use the Kubernetes binaries
kubectl, kubeadm, and kubelet:
We add access to the Google repository (http://apt.kubernetes.io) that will allow us to download and install the binaries:
All these steps must be performed in all nodes of our cluster (master and nodes).
192.168.1.55 is the IP address of the master node (kubmaster) we defined previously, and 10.244.0.0/16 is a mask for the
Kubernetes internal network defining the range that Kubernetes will use to assign IP addresses within its network.
We receive the following output:
As can be read in the output, to start using our cluster, we need to create the configuration file to work with kubectl (as a
regular user):
mkdir -p $HOME/.kube
cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
chown $(id -u):$(id -g) $HOME/.kube/config
To put the internal network in place, we need to provide a network between nodes in the cluster. For this, we will use
Flannel, which is a very simple way to configure a layer 3 network fabric designed for Kubernetes. We need to provide the
possibility for managing the internal network (a command to be launched in all nodes):
sysctl net.bridge.bridge-nf-call-iptables=1
We then install our Flannel network using a configuration file (kube-flannel.yml), available online, by typing the follow-
ing command in the master node:
We can check the status of the Pods in the master node (Flannel network, kube-scheduler, kube-apiserver,
kube-controller-manager, kube-proxy, Pods managing internal DNS, a Pod that stores configurations with etcd, etc.):
If everything is running, it is time to join the nodes to the master. We copy the previously generated token and type the
following command in the nodes (kubenode1):
We return to the master node and type the following command to check the status:
From typing the command below in the master node, we can see that our master and kubnode1 are ready:
FROM jupyter/scipy-notebook
As can be seen, we have set our environment variables and have installed joblib, which allows serialization and deser-
ialization of our trained models, and paramiko, which is a Python implementation of the SSHv2 protocol (require-
ments.txt).
We have set environment variables from the beginning to persist the trained model and add data and metadata. We have
copied the train.py file into the image. We have also copied the id_rsa file, generated using ssh-keygen, to be able to connect
to a remote server through SSH.
We must set up an RSA key authentication to establish the connection between the cluster and the external server. We
need to generate a public (id_rsa.pub) and a private key (id_rsa) that we can use to authenticate:
ssh-keygen -t rsa
6.5 Machine Learning with Docker and Kubernetes: Training Models 419
We copy and paste the content of id_rsa.pub into /.ssh/authorized_keys (on the target system):
vim /home/xavi/.ssh/id_rsa.pub
If we wish, we can copy the id_rsa file into our current directory and modify the permissions with chmod:
sudo cp /home/xavi/.ssh/id_rsa .
sudo chmod a+r id_rsa
We can then remove the id_rsa file with the docker run command.
For this step, we could use various methodologies such as writing a Dockerfile similar to the following by ensuring we
remove the id_rsa file at the end of the build process:
ARG SSH_PRIVATE_KEY
RUN mkdir /root/.ssh/
RUN echo "${SSH_PRIVATE_KEY}" > /root/.ssh/id_rsa
# [...]
RUN rm /root/.ssh/id_rsa
We can build the Docker image with code similar to the following:
It is not the topic of this chapter, but we need to be careful regarding not leaving traces inside our Docker image. Even if we
are deleting a file, it still can be viewed in one of the layers of the image we will push. We can use the –squash parameter to
reduce multiple layers between the origin and the latest stage to one. In addition, the –squash parameter can be used to
reduce the size of an image by removing files that are no longer present. We can also work with multi-stage builds in a
single Dockerfile in which we build multiple Docker images. Only the last one will persist and leave traces:
# This is intermediate
FROM ubuntu as intermediate
# [...]
# Final image
FROM ubuntu
# [...]
420 6 Machine Learning in Production
For security reasons, we can also use Kubernetes secret objects to store and manage sensitive information such as pass-
words, OAuth tokens, and SSH keys. Putting this information in a secret object is safer and more flexible than putting it in
the definition of a Pod or in an image container: https://kubernetes.io/docs/concepts/configuration/secret/.
Now that we have our Dockerfile, let us look at other files. In train.py, we import the necessary libraries, read the envi-
ronment variables set in our Docker image for persisting models, load training data (train.csv), which is stored on GitHub
(https://raw.githubusercontent.com/xaviervasques/kubernetes/main/train.csv), train two models (linear discriminant
analysis and a multilayer perceptron neural network), serialize them, perform a cross-validation, and upload the trained
models and cross-validation results in a remote server (192.168.1.11) and specified directory (/home/xavi/output/). We
could also define the URL, specified directory, and the IP address as environment variables:
#!/usr/bin/python3
# train.py
# Xavier Vasques 16/05/2021
import os
import json
import numpy as np
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import cross_val_score
import pandas as pd
from joblib import dump, load
from sklearn import preprocessing
import paramiko
def train():
MODEL_DIR = os.environ["MODEL_DIR"]
MODEL_FILE_LDA = os.environ["MODEL_FILE_LDA"]
MODEL_FILE_NN = os.environ["MODEL_FILE_NN"]
METADATA_FILE = os.environ["METADATA_FILE"]
MODEL_PATH_LDA = os.path.join(MODEL_DIR, MODEL_FILE_LDA)
MODEL_PATH_NN = os.path.join(MODEL_DIR, MODEL_FILE_NN)
METADATA_PATH = os.path.join(MODEL_DIR, METADATA_FILE)
# Models training
# Serialize model
from joblib import dump
dump(clf_lda, MODEL_PATH_LDA)
print("Moving to 192.168.1.11")
client = paramiko.SSHClient()
client.set_missing_host_key_policy(paramiko.AutoAddPolicy())
k = paramiko.RSAKey.from_private_key_file('id_rsa')
client.connect("192.168.1.11", username="xavi",pkey=k)
sftp.put(MODEL_PATH_LDA,"/home/xavi/output/"+MODEL_FILE_LDA)
sftp.put(MODEL_PATH_NN,"/home/xavi/output/"+MODEL_FILE_NN)
sftp.put(METADATA_PATH,"/home/xavi/output/"+METADATA_FILE)
sftp.close()
client.close()
if __name__ == '__main__':
train()
422 6 Machine Learning in Production
We can go to the next step by building the Docker image, running it to test our application locally, tagging it with the name
of an image repository on the Docker Hub registry, and pushing it to the registry, to be ready to use the image in our Kuber-
netes cluster:
To test if our code is functional, we will run a container locally and test our built image:
If everything is working, we can push a new image to the repository using the CLI:
docker login
docker tag kubernetes-models:latest xaviervasques/kubernetes-models:latest
docker push xaviervasques/kubernetes-models:latest
apiVersion: batch/v1
kind: Job
metadata:
name: train-models-job
6.5 Machine Learning with Docker and Kubernetes: Training Models 423
spec:
template:
spec:
containers:
- name: train-container
imagePullPolicy: Always
image: xaviervasques/kubernetes-models:latest
command: ["python3", "train.py", "rm", "./id_rsa" ]
restartPolicy: Never
backoffLimit: 4
As explained in the Kubernetes documentation, as with all other Kubernetes configurations, a Job needs apiVersion,
kind, and metadatafields. A Job also needs a .spec section. apiVersion specifies the version of the Kubernetes API to
use, and kind is the type of Kubernetes resource; we can provide a label with metadata. The .spec.template is the only
required field of .spec and represents a Pod template, as it has the same scheme except that it is nested with no apiVersion
or kind. In .spec.template.spec.containers, we provide each container with a name, the image we desire to use, and the
command we want to run in the container. Here, we want to run train.py and remove our id_rsa file.
The equivalent docker command would be the following:
Kubernetes will look for the image from the registry instead of using a cached image, thanks to imagePullPolicy. Finally,
we set whether containers should be restarted if they fail (Never or OnFailure) and decide the number of retries before
considering a Job as failed (the back-off limit is set by default to 6 minutes).
We are finally ready to get our application running on Kubernetes. We launch the following command:
Reset kubeadm:
kubeadm reset
sudo apt-get purge kubeadm kubectl kubelet kubernetes-cni kube*
sudo apt-get autoremove
sudo rm -rf ~/.kube
The next steps would be to perform batch inference and scoring by loading predictions from our trained models and also to
perform real-time, online inference using REST APIs. In addition, it is essential to explore the way we set the hybrid or
multi-cloud architecture (end to end) to run our models in production in an open and agile environment.
In this section, we will implement a batch inference from trained models in the Kubernetes Cluster we developed and
installed in the previous chapter.
All the files used in this chapter can be found on GitHub at https://github.com/xaviervasques/kubernetes.git.
To begin, we will need to modify our previous Dockerfile as follows:
FROM jupyter/scipy-notebook
#!/usr/bin/python3
# inference.py
# Xavier Vasques 13/04/2021
6.6 Machine Learning with Docker and Kubernetes: Batch Inference 425
import os
import numpy as np
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neural_network import MLPClassifier
import pandas as pd
from joblib import load
from sklearn import preprocessing
import paramiko
def inference():
MODEL_DIR = os.environ["MODEL_DIR"]
MODEL_FILE_LDA = os.environ["MODEL_FILE_LDA"]
MODEL_FILE_NN = os.environ["MODEL_FILE_NN"]
MODEL_PATH_LDA = os.path.join(MODEL_DIR, MODEL_FILE_LDA)
MODEL_PATH_NN = os.path.join(MODEL_DIR, MODEL_FILE_NN)
print(MODEL_PATH_LDA)
sftp_client = ssh_client.open_sftp()
remote_file = sftp_client.open("/Users/LRENC/Desktop/Kubernetes/Jenkins/
"+MODEL_FILE_LDA)
#clf_lda = load(MODEL_PATH_LDA)
clf_lda = load(remote_file)
print("LDA score and classification:")
print(clf_lda.score(X_test, y_test))
print(clf_lda.predict(X_test))
426 6 Machine Learning in Production
remote_file.close()
print(MODEL_PATH_NN)
sftp_client = ssh_client.open_sftp()
remote_file = sftp_client.open("/home/xavi/output/"+MODEL_FILE_NN)
# Run model
#clf_nn = load(MODEL_PATH_NN)
clf_nn = load(remote_file)
print("NN score and classification:")
print(clf_nn.score(X_test, y_test))
print(clf_nn.predict(X_test))
remote_file.close()
if __name__ == '__main__':
inference()
In inference.py, we import the necessary libraries, read the environment variables set in our Docker image, and read new
data (test.csv) stored at GitHub (https://raw.githubusercontent.com/xaviervasques/kubernetes/main/test.csv) to feed our
serialized models to make predictions and provide an accuracy score. We will download our previously trained models (lin-
ear discriminant analysis and a multilayer perceptron neural network) stored in a specified directory (/home/xavi/output)
from a remote server (192.168.1.11) using SSH.
We can then go to the next step by building the Docker image, running it to test our application locally, tagging it with the
name of an image repository on the Docker Hub registry, and pushing it to the registry to be ready to use the image in our
Kubernetes cluster:
To test if our code is functional, we will run our container locally to test the image:
If everything is working, we can push the new image to a repository using the CLI:
docker login
docker tag kubernetes-inference:latest xaviervasques/kubernetes-inference:latest
docker push xaviervasques/kubernetes-inference:latest
apiVersion: batch/v1
kind: Job
metadata:
name: inference-job
spec:
template:
spec:
containers:
- name: inference-container
imagePullPolicy: Always
image: xaviervasques/kubernetes-inference:latest
command: ["python3", "inference.py", "rm", "./id_rsa" ]
restartPolicy: Never
backoffLimit: 4
We are finally ready to get our application running on Kubernetes. We launch the following command:
Kubernetes serves as an excellent framework with which to deploy models effectively. We can use Kubernetes to deploy
each of our models as independent and lightweight microservices. These microservices can be used for other applications.
The next steps would be to deploy online inferences using REST APIs. We can also work with Kubernetes to schedule our
training and inference processes to run on a recurring schedule.
6.7 Machine Learning Prediction in Real Time Using Docker, Python Rest APIs with
Flask, and Kubernetes: Online Inference
The idea of this section is to create a Docker container to perform online inference with trained machine learning models
using Python APIs with Flask. As an example of this concept, we will implement online inferences (linear discriminant
analysis and multilayer perceptron neural network models) with Docker and Flask-RESTful.
To start, let us consider the following files:
•• Dockerfile
train.py
•• api.py
requirements.txt
•• train.csv
test.json
•• API 1: We will give a row number to the API that will extract for us the data from the selected row and print it.
API 2: We will give a row number to the API that will extract the selected row, inject the new data into the models, and
retrieve the classification prediction (# Letter variable in the data).
• API 3: We will ask the API to take all the data in the test.json file and instantly print the classification score of the models.
#!/usr/bin/python3
# api.py
# Xavier Vasques 13/04/2021
import json
import os
import numpy as np
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neural_network import MLPClassifier
import pandas as pd
from joblib import load
from sklearn import preprocessing
# API 1
# Flask route so that we can serve HTTP traffic on that route
@app.route(’/line/<Line>’)
# Get data from json and return the requested row defined by the variable Line
def line(Line):
with open(’./test.json’, ’r’) as jsonfile:
file_data = json.loads(jsonfile.read())
# We can then find the data for the requested row and send it back as json
return json.dumps(file_data[Line])
430 6 Machine Learning in Production
# API 2
# Flask route so that we can serve HTTP traffic on that route
@app.route(’/prediction/<int:Line>’,methods=[’POST’, ’GET’])
# Return prediction for both Neural Network and LDA inference model with the requested
row as input
def prediction(Line):
data = pd.read_json(’./test.json’)
data_test = data.transpose()
X = data_test.drop(data_test.loc[:, ’Line’:’# Letter’].columns, axis = 1)
X_test = X.iloc[Line,:].values.reshape(1, -1)
clf_lda = load(MODEL_PATH_LDA)
prediction_lda = clf_lda.predict(X_test)
clf_nn = load(MODEL_PATH_NN)
prediction_nn = clf_nn.predict(X_test)
# API 3
# Flask route so that we can serve HTTP traffic on that route
@app.route(’/score’,methods=[’POST’, ’GET’])
# Return classification score for both Neural Network and LDA inference model from the
all dataset provided
def score():
data = pd.read_json(’./test.json’)
data_test = data.transpose()
y_test = data_test[’# Letter’].values
X_test = data_test.drop(data_test.loc[:, ’Line’:’# Letter’].columns, axis = 1)
clf_lda = load(MODEL_PATH_LDA)
score_lda = clf_lda.score(X_test, y_test)
clf_nn = load(MODEL_PATH_NN)
score_nn = clf_nn.score(X_test, y_test)
if __name__ == "__main__":
app.run(debug=True, host=’0.0.0.0’)
The first step, after importing dependencies including the open-source web microframework Flask, is to set the environ-
ment variables that are written in the Dockerfile. We also need to load our linear discriminant analysis and multilayer per-
ceptron neural network serialized models. We create our Flask application by writing app = Flask(__name__). Then, we
create our three Flask routes so that we can serve HTTP traffic on those routes:
• http://0.0.0.0:5000/line/250: Get data from test.json and return the requested row defined by the variable Line (in this
example, we want to extract the data of row number 250).
6.7 Machine Learning Prediction in Real Time Using Docker, Python Rest APIs with Flask, and Kubernetes: Online Inference 431
• http://0.0.0.0:5000/prediction/51: Return a classification prediction from both LDA and neural network-trained models
by injecting the requested data (in this example, we want to inject the data of row number 51).
• http://0.0.0.0:5000/score: Return a classification score for both the neural network and the LDA inference models on all
the available data (test.json).
The Flask routes allow us to request what we need from the API by adding the name of our procedure (/line/<Line>,
/prediction/<int:Line>, /score) to the URL (http://0.0.0.0:5000). Whatever data we add, api.py will always return the
output we request.
#!/usr/bin/python3
# tain.py
# Xavier Vasques 13/04/2021
import os
import numpy as np
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neural_network import MLPClassifier
import pandas as pd
from joblib import dump
from sklearn import preprocessing
def train():
# Models training
# Serialize model
from joblib import dump
dump(clf_lda, MODEL_PATH_LDA)
# Serialize model
from joblib import dump, load
dump(clf_NN, MODEL_PATH_NN)
if __name__ == '__main__':
train()
FROM jupyter/scipy-notebook
The -p flag exposes port 5000 in the container to port 5000 on our host machine, and the -it flag allows us to see the logs
from the container; we run python3 api.py in the my-api image.
The output is the following:
It can be seen that we are running on http://172.17.0.2:5000/; we can now use our web browser or the curl command to
issue a POST request to the IP address:
Curl http://172.17.0.2:5000/line/23
We will get the row number 23 extracted from our data (test.json).
If we type
curl http://172.17.0.2:5000/prediction/23
The above output means that the LDA model has classified the provided data (row 23) as letter 21 (U) while multilayer
perceptron neural network has classified the data as letter 0 (A). The two models do not agree.
434 6 Machine Learning in Production
If we type
curl http://172.17.0.2:5000/score
As can be seen, we should trust the multilayer perceptron neural network more, with its accuracy score of 0.59, even
though the score is not so high. There is some work to do to improve the accuracy!
Now that our application is working properly, we can move to the next step and deploy it in a Kubernetes Cluster. Before
doing that, let us push the image to a repository using the CLI:
docker login
docker tag my-kube-api:latest xaviervasques/my-kube-api:latest
docker push xaviervasques/my-kube-api:latest
swapoff -a
vim /etc/fstab
echo \
"deb [arch=amd64 signed-by=/usr/share/keyrings/docker-archive-keyring.gpg]
https://download.docker.com/linux/ubuntu \
$(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
Then, after changing to root (sudo -s), we perform a curl to obtain the gpg key that will allow us to use the Kubernetes
binaries kubectl, kubeadm, and kubelet:
We add access to the Google repository (http://apt.kubernetes.io), which will allow us to download and install the
binaries:
All these steps must be performed in all nodes of the cluster (master and nodes).
192.168.1.55 is the IP address of the master node (kubmaster) we defined previously, and 10.244.0.0/16 is a mask for the
Kubernetes internal network defining the range that Kubernetes will use to assign IP addresses within its network.
We receive the following output:
As can be read in the output, to start using our cluster, we need to create a configuration file to work with kubectl (as a
regular user):
mkdir -p $HOME/.kube
cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
chown $(id -u):$(id -g) $HOME/.kube/config
To establish the internal network, we need to provide a network between nodes in the cluster. For this task, we will use
Flannel, which is a very simple way to configure a layer 3 network fabric designed for Kubernetes. We need to provide the
possibility of managing the internal network (command to be launched in all nodes):
sysctl net.bridge.bridge-nf-call-iptables=1
We then install our Flannel network using a configuration file (kube-flannel.yml, available online) by typing the follow-
ing command in the master node:
We can check the status of the Pods in the master node (Flannel network, kube-scheduler, kube-apiserver, kube-control-
ler-manager, kube-proxy, Pods managing internal DNS, a Pod that stores configurations with etcd, etc.):
If everything is running, it is time to join the nodes to the master. For this, we copy the previously generated token and
type the following command in the nodes (kubenode1):
We come back to the master node and type the following command to check the status:
If we type the command below in the master node, we can see that our master and kubnode1 are ready:
curl -s https://api.github.com/repos/kubernetes-sigs/kustomize/releases |
\grep browser_download |\
grep linux |\cut -d '"' -f 4 |\
grep /kustomize/v |\
sort | tail -n 1 |\
xargs curl -O -L && \
tar xzf ./kustomize_v*_linux_amd64.tar.gz && \
mv kustomize /usr/bin/
We then create a folder named “base” in the master node and create the following YAML files inside it:
•• namespace.yaml
deployment.yaml
•• service.yaml
kustomization.yaml
apiVersion: v1
kind: Namespace
metadata:
name: mlops
438 6 Machine Learning in Production
The deployment.yaml file will let us manage a set of identical Pods. If we do not use a deployment, we would need to
create, update, and delete many Pods manually. It is also a way to easily autoscale our applications. In our example, we have
decided to create two Pods (replicas), load the Docker image that we had pushed previously, and run our api.py script:
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: my-app
env: qa
name: my-app
namespace: mlops
spec:
replicas: 2 # Creating two PODs for our app
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
env: qa
spec:
containers:
- image: xaviervasques/my-kube-api:latest # Docker image name, that we pushed to GCR
name: my-kube-api # POD name
command: ["python3", "api.py" ]
ports:
- containerPort: 5000
protocol: TCP
The service.yaml file will expose our application running on a set of Pods as a network service:
apiVersion: v1
kind: Service
metadata:
name: my-app
labels:
app: my-app
namespace: mlops
spec:
type: LoadBalancer
ports:
- port: 5000
targetPort: 5000
selector:
app: my-app
6.7 Machine Learning Prediction in Real Time Using Docker, Python Rest APIs with Flask, and Kubernetes: Online Inference 439
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- namespace.yaml
- deployment.yaml
- service.yaml
To deploy our application, we use this single command in our master node:
To see all components deployed into this namespace, we can enter the following:
kubectl get ns
To see the status of the deployment, we can use the following command:
We are now ready to use our deployed model by using curl or a web browser:
curl http://10.97.99.101:5000/line/23
We will get the row number 23 extracted from our data (test.json).
If we type
curl http://10.97.99.101:5000/prediction/23
440 6 Machine Learning in Production
curl http://10.97.99.101:5000/score
We will then see the score of our models on the entire dataset:
6.8 A Machine Learning Application that Deploys to the IBM Cloud Kubernetes Service:
Python, Docker, Kubernetes
We are going to see that compared to the previous descriptions; it is very easy to create a Kubernetes cluster with IBM Cloud.
The wealth of Kubernetes resources can make it difficult to find the basics. An easy way to simplify Kubernetes development
and make it easy to deploy is to use solutions such as IBM Cloud Kubernetes Services. To create a machine learning appli-
cation that deploys to the IBM Cloud Kubernetes Service, we need an IBM Cloud account (sign up for a free account: https://
cloud.ibm.com/registration), IBM Cloud CLI, Docker CLI, and Kubernetes CLI.
•• A free cluster (one worker pool with a single virtual-shared worker node with 2 cores, 4 GB RAM, and 100 GB SAN).
A fully customizable standard cluster (virtual-shared, virtual-dedicated, or bare metal) for the heavy lifting.
If we only want to explore, the free cluster is ideal.
In IBM Cloud, with a few clicks, we can automatically create a Kubernetes service. First, we need to connect to our IBM
Cloud Dashboard at https://cloud.ibm.com/dashboard/apps.
We go to IBM Kubernetes Service, click on Create clusters, and type in a name for our cluster. Depending on our account
(paid or free), we can select the appropriate cluster type (in our case we will only create a worker node with 2 vCPUs and 4
GB of RAM). After a few minutes, the cluster is created:
6.8 A Machine Learning Application that Deploys to the IBM Cloud Kubernetes Service: Python, Docker, Kubernetes 441
Once the cluster is ready, we can click on our cluster’s name, and we will be directed to a new page with information about
our cluster and worker node:
To connect to our cluster, we can click on our worker node tab to get the public IP address of the cluster:
Done! We can have fast access using the IBM Cloud Shell at https://cloud.ibm.com/docs/containers?topic=containers-
cs_cli_install#cloud-shell.
If we want to use our own terminal, we need some prerequisites (if they are not already installed). We need to install the
required CLI tools: IBM Cloud CLI, Kubernetes Service plug-in (ibmcloud ks), and Kubernetes CLI (kubectl).
To install the IBM Cloud CLI, we will type the following in a terminal to install the stand-alone IBM Cloud CLI
(ibmcloud):
The above command is for Linux. All necessary commands for various distributions can be found at https://cloud.ibm.
com/docs/containers?topic=containers-cs_cli_install.
We log in to the IBM Cloud CLI by entering our IBM Cloud credentials when prompted:
ibmcloud login
If we have a federated ID, we can use ibmcloud login –sso to log in to the IBM Cloud CLI.
442 6 Machine Learning in Production
Otherwise, we can also connect with an IBM Cloud API key as follows:
If one is not already available, we can create an IBM Cloud API key. To do this, we need to go to the IBM Cloud console,
then go to Manage > Access (IAM), and select API keys:
We can click create an IBM Cloud API key, add a name and description, and copy or download the API key to a secure
location. We can then log in using the command above.
We can install the IBM Cloud plug-in for the IBM Cloud Kubernetes Service (ibmcloud ks):
We can also install the IBM Cloud plug-in for the IBM Cloud Container Registry (ibmcloud cr):
We can further install the IBM Cloud Kubernetes Service observability plug-in (ibmcloud ob):
The Kubernetes CLI is already installed in our environment. If it is not already installed, we can follow the steps at https://
cloud.ibm.com/docs/containers?topic=containers-cs_cli_install.
If we want to list all the clusters in the account, we can input the following:
ibmcloud ks cluster ls
6.8 A Machine Learning Application that Deploys to the IBM Cloud Kubernetes Service: Python, Docker, Kubernetes 443
We can check if our cluster is in a healthy state by running the following command:
Here, IBM_Cloud_node is our cluster name; we can also use the ID of the cluster.
•• Dockerfile
train.py
•• api.py
requirements.txt
def train():
if __name__ == '__main__':
train()
We also need to build an API that will ingest the data (X_test) and output what we want. In our case, we will only request
the classification score of the model:
#!/usr/bin/python3
# api.py
# Xavier Vasques 03/06/2021
import os
from sklearn import svm
from joblib import dump, load
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from joblib import load
# Loading model
print("Loading model from: {}".format(MODEL_PATH))
inference = load(MODEL_PATH)
# API
# Flask route so that we can serve HTTP traffic on that route
@app.route('/score',methods=['POST', 'GET'])
# Return predictions of inference using Iris Test Data
def prediction():
# Classification score
clf = load(MODEL_PATH)
score = clf.score(X_test, y_test)
if __name__ == "__main__":
app.run(debug=True, host='0.0.0.0')
We are now ready to containerize the Flask application. In our project directory, we have created our Dockerfile with
jupyter/scipy-notebook image as our base image, set our environment variables, and installed joblib and flask; we copy
train.py and api.py files into the image:
FROM jupyter/scipy-notebook
EXPOSE 5000
We want to expose the port (5000) on which the Flask application runs, so we have used EXPOSE.
To verify that our application is running without issue, let us build and run our image locally:
curl http://172.17.0.2:5000/score
We need to install the Container Registry plug-in locally using the following command:
ibmcloud login
We log our local Docker daemon into the IBM Cloud Container Registry using the following command:
We can verify the status of our image by checking whether it is on our private registry:
ibmcloud cr image-list
ibmcloud ks clusters
Our my-k8s Kubernetes cluster is up and running. We can connect kubectl to the cluster:
We will create a folder named “base” in the master node and create the following YAML files inside it:
•• namespace.yaml
deployment.yaml
•• service.yaml
service_port.yaml
• kustomization.yaml
apiVersion: v1
kind: Namespace
metadata:
name: mlapi
The deployment.yaml will let us manage a set of identical Pods. If we do not use a deployment, we would need to create,
update, and delete many Pods manually. It is also a way to easily autoscale our applications. In our example, we have
decided to create two Pods (replicas), load our Docker image that we pushed previously, and run our api.py script:
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
6.8 A Machine Learning Application that Deploys to the IBM Cloud Kubernetes Service: Python, Docker, Kubernetes 449
app: my-app
env: qa
name: my-app
namespace: mlapi
spec:
replicas: 1 # Creating PODs for our app
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
env: qa
spec:
containers:
- image: de.icr.io/xaviervasques/my-kube-api:latest # Docker image name, that we
uploaded
name: my-kube-api # POD name
command: ["python3", "api.py" ]
ports:
- containerPort: 5000
protocol: TCP
imagePullSecrets:
- name: all-icr-io
The service.yaml file will expose our application running on a set of Pods as a network service:
apiVersion: v1
kind: Service
metadata:
name: my-app
labels:
app: my-app
namespace: mlapi
spec:
type: LoadBalancer
ports:
- port: 5000
targetPort: 5000
selector:
app: my-app
apiVersion: v1
kind: Service
metadata:
name: nodeport
spec:
450 6 Machine Learning in Production
type: NodePort
ports:
- port: 32743
The reason we create the service_port.yaml file is to make our containerized app accessible over the internet by using the
public IP address of any worker node in a Kubernetes cluster and exposing a node port (NodePort). We can use this option
for testing the IBM Cloud Kubernetes Service and for short-term public access (https://cloud.ibm.com/docs/containers?
topic=containers-nodeport).
Finally, we create the kustomization.yaml file:
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- namespace.yaml
- deployment.yaml
- service_port.yaml
- service.yaml
We can configure our own image pull secret to deploy containers in Kubernetes namespaces other than the default name-
space. With this methodology, we can use images stored in other IBM Cloud accounts or images stored in external private
registries. In addition, we can create our own image pull secret to enforce IAM access rules that restrict rights to specific
registry image namespaces or actions (such as push or pull). We have several options to do that; one of them is to copy the
image fetch secret from the Kubernetes default namespace to other namespaces in our cluster (https://cloud.ibm.com/docs/
containers?topic=containers-registry#other).
Let us start by listing the namespaces in our cluster:
Then, let us list the image pull secrets in the Kubernetes default namespaces for the IBM Cloud Container Registry:
To deploy our application, we use the following single command in our master node:
We copy the all-icr-io image extraction secret from the default namespace to the namespace of our choice. The new image
fetch secrets are named <namespace_name> -icr- <region> -io:
To see all components deployed into this namespace, we can use the following command:
kubectl get ns
To see the status of the deployment, we can use the following command:
We can obtain the public IP address of a worker node in the cluster. If we want to access the worker node on a private
network or a VPC cluster, we obtain the private IP address instead:
We are now ready to use our deployed model by using curl or a web browser:
curl http://172.21.193.80:31261/score
We can navigate further through our Kubernetes Dashboard to check our services and many other features.
When we work on putting machine or deep learning models into production, there are questions that arise at some point:
Where can I deploy my code for training, and where can I deploy my code for batch or online inference? It can often happen
that we need to deploy our machine learning flow on a multi-architecture environment and hybrid cloud or multi-cloud
environment. We have seen how to deploy an application on IBM Cloud and how to deploy using on-premises or virtual
machines. Kubernetes can run on a variety of platforms: from simple clusters to complex ones, from laptops to multi-archi-
tectures, and on hybrid cloud or multi-cloud Kubernetes Clusters. The question is what solution best suits our needs.
6.9 Red Hat OpenShift to Develop and Deploy Enterprise ML/DL Applications
Investment is significantly increasing in ML/DL to create value, with different objectives such as masking complexity, pro-
ducing automation, reducing costs, growing businesses, better serving customers, making discoveries, performing research
and innovation, and other goals. Strong open-source communities for ML/DL on Kubernetes and OpenShift have been cre-
ated and are evolving. These communities are working to allow data scientists and developers to access and consume ML/
DL technologies. Working on a local computer and putting ML/DL models into production requires navigating across a vast
and complex space. Where do we deploy our code for training, and where do we deploy our code for batch or online infer-
ences? There are situations in which we will need to deploy machine learning workflows in a multi-architecture environ-
ment or in a hybrid cloud environment.
Today’s data centers are made of heterogeneous systems (×86, IBM Power Systems, IBM Z, high-performance com-
puting, accelerators such as GPUs or FPGAs, and others), running heterogeneous workloads with specialized ML/DL
frameworks, each with its strengths. In addition, we can see the cloud in all its dimensions (public and private cloud,
hybrid cloud, multi-cloud, distributed cloud). For instance, we can maintain a database with critical data running on
an IBM Power System that we want to leverage for our models, run our training code using GPUs, deploy batch or
online inference on IBM Z or LinuxONE in which critical transactional applications can avoid latency, and perform
another inference on a cloud or at the edge. There are an important number of options to consider depending on
one’s business. A typical ML/DL workflow starts with a business objective and involves a design to understand users,
challenge assumptions, redefine problems, co-create (for instance, putting IT and data scientist teams in the same
6.9 Red Hat OpenShift to Develop and Deploy Enterprise ML/DL Applications 453
room) a solution to prototype, and test by iteration. We then collect private and public data, refine and store the data,
and create and validate models until we put everything into production for the real world. We need to consider scal-
ability of the application, resilience, versioning, security, availability, and other aspects. This requires additional
expertise, often related to specialized hardware resources, increasing the need for resource management and
utilization.
Data scientists cannot manage the entire process, which can be complex; for those I personally know, they want to have
access to high-performant hardware and be focused on the data and the creation of models. This is also why we can see
machine learning applications not being completely exploited because they are not prepared for production. Containers
and Kubernetes can avoid this kind of situation by accelerating ML/DL adoption and breaking all these barriers. There
is a clear movement to embrace Linux containers and Kubernetes to develop ML/DL applications and deploy them. Con-
tainers and Kubernetes are a way to simplify the access to underlying infrastructure by masking the complexity, allowing
management of the different workflows such as development or application lifecycles. Red Hat OpenShift will provide addi-
tional capabilities that are well suited for enterprise environments.
One great advantage of Red Hat OpenShift is the management of container images with Image Stream, allowing, for
example, changing a tag for an image in a container registry without downloading the image; we can tag it locally and
push it back. With OpenShift, once an image is uploaded, it can be managed within OpenShift by its virtual tag. It is also
possible to define triggers that, for example, start deployment when a tag changes its reference (e.g., from devel to stable or
prod tag) or if a new image appears. Another difference between Kubernetes and OpenShift is the web-based user inter-
face. The Kubernetes dashboard must be installed separately, and we can access it via kube-proxy. Red Hat OpenShift’s
web console has a login page that is easy to access and very helpful for daily administrative work, as resources can be
created and changed via a form. We can install Red Hat OpenShift clusters in the cloud using managed services (Red
Hat OpenShift on IBM Cloud, Red Hat OpenShift Service on AWS, Azure Red Hat OpenShift) or we can run them on
our own by installing from another cloud provider (AWS, Azure, Google Cloud, platform-agnostic). We also have the
possibility to create clusters on supported infrastructure (bare metal, IBM Z, Power, Red Hat OpenStack, Red Hat Vir-
tualization, vSphere, platform-agonistic) or a minimal cluster on our laptop (MacOS, Linux, Windows), which is useful
for local development and testing.
6.9.3 Why Red Hat OpenShift for ML/DL? To Build a Production-Ready ML/DL Environment
I believe everybody would agree that creating high-performant ML/DL models and deploying ML/DL in production
require different sets of skills. To allow deployment of an ML/DL application in production, we need to put in place
an iterative process involving setting the business goals, gathering and preparing the data, developing models, deploy-
ing models, inferencing, monitoring, and managing accuracy over time. To execute this process, we need to imple-
ment an ML/DL architecture with ML/DL tools, DevOps tools, data pipelines, and access to resources (computing,
storage, network) whether in private, public, hybrid, or multi-cloud environments. Red Hat OpenShift is making a
difference because it allows data scientists and developers to focus on their models and code and deploy them on
Kubernetes without the need to learn Kubernetes in depth. We automate once, and then we simply develop the
IT environment. In other words, it is possible to manage the complexity of ML/DL model deployments and democ-
ratize access to the techniques, allowing the deployment of any containerized ML/DL stack at scale in any
environment.
Red Hat OpenShift has many features and benefits that can help data scientists and developers to focus on their business
and use the tools and languages with which they are most comfortable. OpenShift puts into action additional security con-
trols as well as the tools to manage multiple applications (multitenancy environment). OpenShift makes all IT environments
much easier to manage.
6.10 Deploying a Machine Learning Model as an API on the Red Hat OpenShift
Container Platform: From Source Code in a GitHub Repository with Flask, Scikit-Learn,
and Docker
ML/DL applications have become more popular than ever. As we have seen, Red Hat OpenShift container, an enterprise
Kubernetes platform, helps data scientists and developers to focus on value creation using their preferred tools by bringing
additional security controls into place and making environments much easier to manage. It provides the ability to deploy,
serve, secure, and optimize machine learning models at enterprise scale and in highly available clusters, allowing data scien-
tists to focus on the value of data. We can install Red Hat OpenShift clusters in the cloud using managed services (Red Hat
OpenShift on IBM Cloud, Red Hat OpenShift Service on AWS, Azure Red Hat OpenShift), or we can run them on our own
by installing from another cloud provider (AWS, Azure, Google Cloud, platform-agnostic). We can also create clusters on
supported infrastructure (bare metal, IBM Z, Power, Red Hat OpenStack, Red Hat Virtualization, vSphere, platform-ago-
nistic) or a minimal cluster on a laptop (MacOS, Linux, Windows), which is useful for local development and testing. There
is a lot of freedom here.
6.10 Deploying a Machine Learning Model as an API on the Red Hat OpenShift Container Platform 455
In this section, we will demonstrate how to deploy a simple machine learning model developed in Python on an OpenShift
cluster in the cloud. We will create an OpenShift cluster on IBM Cloud and show how to deploy a machine learning appli-
cation from a GitHub repository and expose the application to public access (with and without a Dockerfile).
We can do all of this with a few simple steps.
We can select options such as location, computing environment to run the cluster, or worker pool (number of vCPUs,
memory, encrypt local disk, etc.) and click Create. One interesting option is that we can choose Satellite, which allows
us to run our cluster in our own data center. For this example, we have chosen Classic:
Done!
Dockerfile
train.py
api.py
requirements.txt
OpenShift will automatically detect whether the Docker or source-build strategy is being used. In our repository, there is a
Dockerfile. OpenShift Enterprise will generate a Docker build strategy. The train.py file is a Python script that loads and
splits the Iris dataset, which is a classic and very simple multi-class classification dataset consisting of petal and sepal lengths
of three different types of iris (Setosa, Versicolour, and Virginica), stored in a 150 × 4 numpy.ndarray. We have used scikit-
learn for both dataset and model creation (support vector machine [SVM] classifier). The file requirements.txt (flask,
flask-restful, joblib) is for the Python dependencies, and api.py is the script that will be called to perform the inference
using a REST API. The API will return the classification score of the SVM model on the test data.
The train.py file is the following:
import os
from sklearn import svm
from joblib import dump, load
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
def train():
# Load directory paths for persisting model
MODEL_DIR = os.environ["MODEL_DIR"]
MODEL_FILE = os.environ["MODEL_FILE"]
MODEL_PATH = os.path.join(MODEL_DIR, MODEL_FILE)
if __name__ == '__main__':
train()
import os
from sklearn import svm
from joblib import dump, load
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from joblib import load
# API
# Flask route so that we can serve HTTP traffic on that route
@app.route('/',methods=['POST', 'GET'])
# Return predictions of inference using Iris Test Data
def prediction():
# Classification score
clf = load(MODEL_PATH)
score = clf.score(X_test, y_test)
if __name__ == "__main__":
app.run(debug=True, host='0.0.0.0', port=8080) # Launch built-in we server and run
this Flask webapp
6.10 Deploying a Machine Learning Model as an API on the Red Hat OpenShift Container Platform 459
FROM jupyter/scipy-notebook
#USER 1001
EXPOSE 8080
flask
flask-restful
joblib
To verify that everything is working, let us build and run the Docker image on our local machine:
cd OpenShift-ML-Online
docker build -t my-ml-api:latest .
docker run my-ml-api
curl http://172.17.0.2:8080/
460 6 Machine Learning in Production
From the perspective switcher, we select Developer to switch to the Developer perspective. We can see that the menu
offers items such as +Add, Builds, and Topology:
6.10 Deploying a Machine Learning Model as an API on the Red Hat OpenShift Container Platform 461
In this section, we will use the CLI on our local machine. We have installed our OpenShift cluster on the IBM Cloud; if this
has not already been performed, it is necessary to install the following:
We then need to log in to OpenShift and create a new project. To do this, we need to copy the login command:
ibmcloud login
oc login --token=sha256~IWefYlUvt1St8K9QAXXXXXX0frXXX2-5LAXXXXNq-S9E
--server=https://c101-e.eu-de.containers.cloud.ibm.com:30785
462 6 Machine Learning in Production
The new-app command allows the creation of applications using source code in a local or remote Git repository. To create
an application using a Git repository, we can type the following command in the terminal:
oc new-app https://github.com/xaviervasques/OpenShift-ML-Online.git
This command has several options, such as using a subdirectory of our source code repository by specifying a --context-dir
flag, specifying a Git branch, or setting the --strategy flag to specify a build strategy. In our case, we have a Dockerfile that
will automatically generate a Socker build strategy.
The application is deployed! The application now needs to be exposed to the outside world. As specified by the previous
output, we can do this by executing the following command:
oc expose service/openshift-ml-online
oc status
We can also find the route of our API in the output and use it to get the expected output (the classification score of the SVM
model on the test data):
curl http://openshift-ml-online-default.mycluster-par01-b-948990-
d8688dbc29e56a145f8196fa85f1481a-0000.par01.containers.appdomain.cloud
We can also find all information about our service in the OpenShift web console.
In the perspective switcher, if we select Developer to switch to the Developer perspective and click on Topology.
At the bottom-right of the screen, the panel displays the public URL at which the application can be accessed. It can be
seen under Routes. If we click on the link, we also obtain the expected result from our application.
We can create unsecured and secured routes using the web console or the CLI. Unsecured routes are the simplest to set up
and represent the default configuration. However, if we want to offer security for connections to remain private, we can use
the create route command and provide certificates and a key.
If we want to delete the application from OpenShift, we can use the oc delete all command:
In a case in which there is no Dockerfile, the OpenShift Source to Image (S2I) toolkit will create a Docker image. The
source code language is auto-detected. OpenShift S2I uses a builder image and its sources to create a new Docker image
that is deployed to the cluster.
It becomes simple to containerize and deploy a machine or deep learning application using Docker, Python, Flask, and
OpenShift. When we begin to migrate workloads into OpenShift, the application is containerized into a container image that
will be deployed in the testing and production environments, decreasing the number of missing dependencies and miscon-
figuration issues that we see when we deploy applications in the real world.
Real life is what it is all about. Machine learning operations requires cross-collaboration among data scientists, develo-
pers, and IT operations, which can be time-consuming in terms of coordination. Building, testing, and training ML/DL
models on Kubernetes hybrid cloud platforms such as OpenShift allows consistent, at-scale application deployments
and helps to deploy, update, and redeploy as often as needed in the production environment. The integrated DevOps
CI/CD capabilities in Red Hat OpenShift allow us to automate the integration of our models with the process of
development.
Further Reading
https://www.ibm.com/cloud/learn/docker
https://mlinproduction.com
https://docs.docker.com/engine/
https://www.ibm.com/cloud/learn/docker
https://mlfromscratch.com/deployment-introduction/#/
https://mlinproduction.com/docker-for-ml-part-3/
https://docs.docker.com/engine/reference/builder/
https://mlinproduction.com/docker-for-ml-part-3/
https://scikit-learn.org/stable/
https://theaisummer.com/docker/
https://www.fernandomc.com/posts/your-first-flask-api/
https://mlinproduction.com/docker-for-ml-part-4/
https://www.kdnuggets.com/2019/10/easily-deploy-machine-learning-models-using-flask.html
https://medium.com/@fmirikar5119/ci-cd-with-jenkins-and-machine-learning-477e927c430d
https://www.jenkins.io
https://cloud.ibm.com/catalog/content/jenkins
https://towardsdatascience.com/automating-data-science-projects-with-jenkins-8e843771aa02
https://phoenixnap.com/blog/kubernetes-vs-docker-swarm
464 6 Machine Learning in Production
https://kubernetes.io/fr/docs/concepts/
https://kubernetes.io/fr/docs/tasks/tools/install-kubectl/
https://www.vagrantup.com/downloads
https://gitlab.com/xavki/presentations-kubernetes/-/tree/master
https://github.com/flannel-io/flannel
https://kubernetes.io/docs/concepts/workloads/controllers/job/
https://vsupalov.com/build-docker-image-clone-private-repo-ssh-key/
https://kubernetes.io/fr/docs/concepts/configuration/secret/
https://developer.ibm.com/technologies/containers/tutorials/scalable-python-app-with-kubernetes/
https://docs.docker.com/engine/install/ubuntu/
https://developer.ibm.com/technologies/containers/tutorials/scalable-python-app-with-kubernetes/
https://mlinproduction.com/k8s-jobs/
https://developer.ibm.com/technologies/containers/tutorials/scalable-python-app-with-kubernetes/
https://cloud.google.com/community/tutorials/kubernetes-ml-ops
https://developer.ibm.com/technologies/containers/tutorials/scalable-python-app-with-kubernetes/
https://cloud.google.com/community/tutorials/kubernetes-ml-ops
https://github.com/IBM/deploy-ibm-cloud-private
https://kubernetes.io/fr/docs/setup/pick-right-solution/#solutions-clés-en-main
https://www.ibm.com/cloud/architecture/tutorials/microservices-app-on-kubernetes?task=1
https://cloud.ibm.com/docs/containers?topic=containers-registry#other
https://cloud.ibm.com/docs/containers?topic=containers-nodeport
https://cloud.ibm.com/docs/containers
https://www.openshift.com
https://en.wikipedia.org/wiki/OpenShift
https://www.openshift.com/learn/topics/ai-ml
https://cloudowski.com/articles/10-differences-between-openshift-and-kubernetes/
https://docs.openshift.com/containerplatform/4.5/cli_reference/developer_cli_odo/understanding-odo.html
https://developer.ibm.com/technologies/containers/tutorials/scalable-python-app-with-kubernetes/
https://cloud.google.com/community/tutorials/kubernetes-ml-ops
https://developer.ibm.com/tutorials/deploy-python-app-to-openshift-cluster-source-to-image/?
mhsrc=ibmsearch_a&mhq=deploy%20python%20app%20to%20openshift%20cluster%20source%20to%20image
https://docs.openshift.com/enterprise/3.1/dev_guide/new_app.html
https://github.com/jjasghar/cloud-native-python-example-app/blob/master/Dockerfile
https://docs.openshift.com/container-platform/3.10/dev_guide/routes.html.
https://docs.okd.io/3.11/minishift/openshift/exposing-services.html.
https://docs.openshift.com/container-platform/3.10/architecture/networking/routes.html#secured-routes.
https://docs.openshift.com/container-platform/4.7/installing/index.html#ocp-installation-overview.
https://www.openshift.com/blog/serving-machine-learning-models-on-openshift-part-1.
https://developer.ibm.com/technologies/containers/tutorials/deploy-python-app-to-openshift-cluster-source-to-image/
https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html.
465
Classical computing has experienced remarkable progress guided by Moore’s law. This law states that every two years, we
double the number of transistors in a processor and at the same time increase performance by or reduce costs by twofold.
This pace has slowed down over the past decade, forcing a transition. We must rethink information technology (IT) and in
particular move toward heterogeneous system architectures with specific accelerators in order to meet the need for perfor-
mance. The progress that has been made in raw computing power has nevertheless brought us to a point at which biolog-
ically inspired computing models are now highly regarded as the state of the art.
Artificial intelligence (AI) is an area that brings opportunities for progress but also challenges alongside. The capabilities
of AI have greatly increased in their ability to interpret and analyze data. AI is also demanding in terms of computing power
because of the complexity of workflows. At the same time, AI can also be applied to the management and optimization of
entire IT systems.
In parallel with conventional or biologically inspired accelerators, programmable quantum computing is emerging,
thanks to several decades of investment in research to overcome traditional physical limitations. This new era of computing
will potentially have the capacity to make calculations that are not possible today with conventional computers. Future
systems will need to integrate quantum computing capabilities to perform specific calculations. Research is advancing rap-
idly. IBM made programmable quantum computers available to the cloud for the first time in May 2016, and has announced
its ambition to double the quantum volume each year, called “Gambetta’s law.”
Machine Learning Theory and Applications: Hands-on Use Cases with Python on Classical and Quantum Machines, First Edition. Xavier Vasques.
© 2024 John Wiley & Sons, Inc. Published 2024 by John Wiley & Sons, Inc.
466 Conclusion: The Future of Computing for Data Science?
The cloud is also an element that brings considerable challenges and opportunities in data science, and it has an important
role to play. The data centers of tomorrow will be piloted by the cloud and equipped with heterogeneous systems that will
run heterogeneous workloads in a secure manner. Data will no longer be centralized or decentralized but instead will be
organized in hubs. Storage systems are also full of challenges to improve, not only in terms of availability, performance, and
management but also regarding data fidelity. We have to design architectures to allow for the extraction of more and more
complex and often regulated data, which poses multiple challenges, in particular security, encryption, confidentiality, and
traceability.
The future of computing, as described by Dario Gil and William Green, will be built with heterogeneous systems made up
of classical computing, called binary or bit systems, biologically inspired computing, and quantum computing, called quan-
tum or qubit systems. These heterogeneous components will be orchestrated and deployed in a hybrid cloud architecture
that masks complexity while allowing the secure use and sharing of private and public systems and data.
Binary Systems
The first binary computers were built in the 1940s: Colossus (1943) and then the Electronic Numerical Integrator
and Computer (ENIAC, IBM, 1945). Colossus was designed to decrypt secret German messages, and the ENIAC
was designed to calculate ballistic trajectories. The ENIAC was the first fully electronic system built to be Turing-
complete: it can be reprogrammed to solve, in principle, all computational problems. The ENIAC was programmed
by women called the “ENIAC women.” The most famous of them were Kay McNulty, Betty Jennings, Betty Holberton,
Marlyn Wescoff, Frances Bilas, and Ruth Teitelbaum. These women had previously performed ballistic calculations
on mechanical desktop computers for the military. The ENIAC weighed 30 tons, occupied an area of 72 m2, and
consumed 140 kW.
Regardless of the task performed by a computer, the underlying process is always the same: an instance of the task is
described by an algorithm that is translated into a sequence of 0s and 1s to give rise to execution in the processor,
memory, and input/output devices of the computer. This is the basis of binary calculation, which in practice is based
on electrical circuits provided with transistors that can be in two modes: “ON,” allowing the current to pass, and “OFF,”
such that the current does not pass. From these 0s and 1s, over the past 80 years we have developed a classical infor-
mation theory constructed from Boolean operators (XAND, XOR, etc.), words (bytes), and a simple arithmetic based on
the following operations: 0 + 0 = 0, 0 + 1 = 1 + 0 = 1, 1 + 1 = 0 (with restraint), and checking whether 1 = 1, 0 = 0, and
1 0. From these operations, it is possible to build much more complex operations, which the most powerful computers
can perform millions of billions of times per second. All this has become so “natural” that we have completely forgotten
that each transaction on a computer server, a PC, a calculator, or a smartphone breaks down into these basic binary
operations. In a computer, these 0s and 1s are contained in “BInary digiTs,” or “bits,” which represent the smallest
amount of information contained in a computer system. The electrical engineer and mathematician Claude Shannon
(1916–2001) was one of the founding fathers of information theory. For 20 years, Shannon worked at the Massachusetts
Institute of Technology (MIT); in addition to his academic activities, he worked at Bell Laboratories. In 1949, he mar-
ried Madame Moore. During the Second World War, Shannon worked for the American secret service in cryptography
to locate messages hidden in German codes. One of his essential contributions concerns the theory of signal transmis-
sion. It is in this context that he developed an information theory, in particular by understanding that any data, even
voice or images, can be transmitted using a sequence of 0s and 1s.
The binary systems used by conventional computers appeared in the middle of the twentieth century, when mathematics
and information were combined in a new way to form information theory, launching both the computer industry and tele-
communications. The strength of the binary system lies in its simplicity and reliability. A bit is either 0 or 1, a state that can
be easily measured, calculated, communicated, or stored. This powerful theory has allowed us to build the systems that are
running critical workloads around the world today. Thanks to this method, various calculations and data storage systems
have emerged, up to the storage of digital data on a DNA molecule.
Today, we have examples of binary systems with incredible possibilities. An example is the mainframe today called
IBM Z. At the time I am writing this book, an IBM Z processor single-chip module (SCM) is using silicon-on-insulator
Conclusion: The Future of Computing for Data Science? 467
(SOI) technology at 14 nm. It contains 9.1 billion transistors, and there are 12 cores per PU SCM at 5.2 GHz. This tech-
nology allows a single system to be able to process 19 billion encrypted transactions per day and 1000 billion web trans-
actions per day. The IBM Zs installed in the world today process 87% of bank card transactions and 8 trillion payments
per year.
We can also cite two of the most powerful computers in the world, Summit and Sierra. They are located in the Oak Ridge
Laboratory in Tennessee and the National Lawrence Laboratory in Livermore, California, respectively. These computers
help model supernovas or new materials, explore solutions against cancer, and study genetics and the environment. Summit
is capable of delivering a computer power of 200 petaflops with a storage capacity of 250 petabytes. It is composed of 9216
IBM Power9 CPUs, 27,648 NVIDIA Tesla GPUs, and a network communication of 25 gigabytes per second between nodes.
Despite all of these abilities, even the most powerful computer in the world, equipped with GPU accelerators, cannot cal-
culate everything.
Today, this type of technology (high-performance computing) is essential for medical research, as we saw during the
COVID-19 crisis. We can take the example of using the power of supercomputers with the HPC COVID-19 consortium
(https://covid19-hpc-consortium.org). This is a public–private effort initiated by several institutions, including IBM, aimed
at making the power of supercomputers available to researchers working on projects related to COVID-19 to help them
identify potential short-term therapies for patients affected by the virus.
Together, the Consortium has helped support many research projects, including understanding how long respiratory dro-
plets persist in the air. They found that droplets from breathing stay in the air much longer than previously thought, due to
their small size compared to droplets from coughs and sneezes. Another project concerns research into the reuse of drugs for
potential treatments. A project by a Michigan State University team has examined data from approximately 1600 FDA-
approved drugs to determine whether there are possible combinations that could help treat COVID-19. They have found
at least two drugs approved by the FDA that appear to be promising: proflavin, a disinfectant against many bacteria, and
chloroxin, another antibacterial drug.
As we may have thousands of candidate molecules for potential therapeutic treatment, the use of accelerated systems
and deep learning allows the best matches to be filtered in order to provide a selection of chemical compounds capable
of attaching to pathogen proteins. By doing this, the speed of the drug design process could be increased considerably.
AI will also help researchers to better profile the protein–protein interactions involved in the development of pathol-
ogies as well as to better understand the dynamics of infections in human cells. Thanks to this innovative approach, the
development cycle of therapeutic treatment could be accelerated, potentially going from several years to a few months
or even a few weeks.
Computers, smartphones and their applications, and the internet that we use in our everyday lives work with 0s and
1s. The binary system coupled with Moore’s law, a 50-year heritage, has made it possible to build systems that are robust
and reliable. For 50 years, we have seen incremental, linear evolution to produce performance gains. The next few years
will bring further innovations to produce performance gains, particularly in terms of materials, control processes, or
etching methods, including three-dimensional transistors, extreme ultraviolet lithography, or new materials such as haf-
nium or germanium. Binary systems therefore continue to evolve and will play a central role in the data center of
tomorrow.
Recently, IBM took a major step forward in chip technology by manufacturing the first 2-nm chip, packing 50 billion
transistors onto a fingernail-sized chip. This architecture can help processor manufacturers improve performance by
45% with the same amount of power as today’s 7-nm chips, producing the same level of performance using 75% less power.
Mobile devices with 2-nm-based processors could have up to four times the battery life of those with 7-nm chipsets. Laptops
would benefit from an increase in the speed of these processors, while autonomous vehicles may detect and respond to
objects faster. This technology will benefit data center energy efficiency, space exploration, AI, 5G and 6G communication,
and quantum computing.
Despite comments about the limitations of Moore’s law and the bottlenecks of current architectures, innovations con-
tinue, and binary systems will play a central role in the data center of tomorrow. Nevertheless, some challenges cannot
be addressed only with binary computing. For these challenges, we need to rethink computing with new approaches, draw-
ing inspiration from natural processes such as biology and physics.
468 Conclusion: The Future of Computing for Data Science?
DNA is also a source of inspiration. In 2017, a team published in the journal Science the capacity to store digital data on a
DNA molecule. They were able to store an operating system, a French film from 1895 (L’Arrivée d’un train à La Ciotat by
Louis Lumière), a scientific article, a photograph, a virus, and a $50 gift card in DNA strands and retrieve the data without
errors.
Indeed, a DNA molecule is intended to store information by nature. Genetic information is encoded in four nitrogenous
bases that make up a DNA molecule (A, C, T, and G). Today, it is possible to transcribe digital data into a new code. DNA
sequencing then makes it possible to read the stored information. Encoding is automated through software. A DNA
molecule is 3 billion nucleotides (nitrogenous bases). In 1 g Ł of DNA, 215 petabytes of data can be stored; it would be pos-
sible to store all the data created by humans in one room using DNA. In addition, DNA can theoretically keep data in perfect
condition for an extremely long time. Under ideal conditions, it is estimated that DNA could still be deciphered after several
million years, thanks to “longevity genes.” DNA can withstand the most extreme weather conditions. The main weak points
today are high costs and processing times that can be extremely long.
The term AI appeared as such in 1956. Several American researchers, including John McCarthy, Marvin Minsky, Claude
Shannon, and Nathan Rochester of IBM, which had been very advanced in research that used computers for other than
scientific calculations, met at the University of Dartmouth in the United States. Three years after the Dartmouth seminar,
the two fathers of AI, McCarthy and Minsky, founded the AI lab at MIT. There was considerable investment, with too much
ambition, in imitating the human brain, and much of that hope was not realized at the time. The promises were broken.
A more pragmatic approach appeared in the 1970s and 1980s, which saw the emergence of machine learning and the reap-
pearance of neural networks in the late 1980s. This more pragmatic approach, an increase in computing power, and an
explosion of data have made it possible for AI to be present in all areas today; it is a transversal subject. The massive
use of AI poses some challenges, such as the need to label the data at our disposal. The problem with automation is that
it requires considerable manual work. AI needs education, and this is performed by tens of thousands of workers around the
world, which does not really appear to be a futuristic vision. Another challenge is the need for computing power. AI needs to
be trained, and AI has become more and more greedy for this process in terms of calculations. Training requires a doubling
of computing capacity every 3.5 months.
Petaflop/s-days
1e+4
AlphaGoZero
3.4-month doubling
1e–4 Deep belief nets and
layer-wise pretraining
DQN
1e–6
TD-Gammon v2.1
BiLSTM for speech
1e–8 LeNet-5
Several approaches are currently being used or envisioned. For example, in the Summit supercomputer today, the cal-
culation of certain workloads is deported to accelerators such as GPUs. There are others, such as field-programmable gate
arrays (FPGAs) or “programmable logic networks,” which can realize desired digital functions. The advantage is that the
same chip can be used in many different electronic systems.
Progress in the field of neuroscience will allow the design of processors directly inspired by the brain. The way our brain
transmits information is not binary, and it is thanks to Santiago Ramón y Cajal (1852–1934), Spanish histologist and neu-
roscientist and winner of the Nobel Prize in physiology or medicine in 1906 with Camillo Golgi, that we now understand
better the architecture of the nervous system. Neurons are cellular entities separated by fine spaces (synapses); they are not
fibers of an unbroken network. The axon of a neuron transmits nerve impulses, action potentials, to target cells. The next
step in developing new types of AI-inspired and brain-inspired processors is to think differently about how we compute
today. Today, one of the major performance problems is the movement of data between the different components of the
von Neumann architecture: processor, memory, and storage. It is therefore imperative to add analog accelerators. What
dominates numerical calculations today, and deep learning calculations, in particular, is floating-point multiplication.
One of the methods envisioned as an effective means of gaining computational power is to go back in the computer history
by reducing precision, also called approximate calculation. For example, 16-bit precision engines are more than fourfold
smaller than 32-bit precision engines. This gain increases performance and energy efficiency. In simple terms, in approx-
imate calculation, we can make a compromise by exchanging numerical precision for the efficiency of calculation. Certain
conditions are nevertheless necessary, such as developing algorithmic improvements in parallel to guarantee iso-precision.
IBM has recently demonstrated the success of this approach with 8-bit floating-point numbers, using new techniques to
maintain the accuracy of gradient calculations and updating weights during backpropagation. Likewise, for inference
by a model resulting from deep learning algorithm training, the unique use of whole arithmetic on four or two precision
bits achieves accuracy comparable to a range of popular models and datasets. This progression will lead to a dramatic
increase in computing capacity for deep learning algorithms over the next decade.
Analog accelerators are another way of avoiding the bottleneck of von Neumann’s architecture. The analog approach uses
non-volatile programmable resistive processing units (RPUs) that can encode the weights of a neural network. Calculations
such as matrix or vector multiplication or the operations of matrix elements can be performed in parallel and in constant
time, without movement of the weights. However, unlike digital solutions, analog AI is more sensitive to the properties of
materials and intrinsically sensitive to noise and variability. These factors must be addressed by architectural solutions, new
circuits, and new algorithms. For example, analogous non-volatile memories (NVMs) can effectively speed up backpropa-
gation algorithms. By combining long-term storage in phase-change memory (PCM) devices, quasi-linear updating of
conventional complementary metal-oxide-semiconductor (CMOS) capacitors, and new techniques to eliminate device-
to-device variability, significant results have begun to emerge for the calculation of Deep Neural Network.
The research has also embarked on a quest to build a chip directly inspired by the brain. In an article published in Science,
IBM and its university partners describe a processor called SyNAPSE, which is made of a million neurons. The chip con-
sumes only 70 mW and is capable of 46 billion synaptic operations per second per watt, literally a synaptic supercomputer
that can be held in a hand. We have moved from neuroscience to supercomputers, new computing architectures, program-
ming languages, algorithms, and applications, and now a new chip called TrueNorth. TrueNorth is a neuromorphic CMOS
integrated circuit produced by IBM in 2014. It is a many-core processor network, with 4096 cores, each having 256 program-
mable simulated neurons for a total of just over 1 million neurons. In turn, each neuron has 256 programmable synapses,
allowing the transport of signals. Therefore, the total number of programmable synapses is slightly more than 268 million.
The number of basic transistors is 5.4 billion. Because memory, computation, and communication are managed in each of
the 4096 neurosynaptic cores, TrueNorth bypasses the bottleneck of the von Neumann architecture and is highly energy-
efficient. It has a power density 1/10,000 that of conventional microprocessors.
In an article published in Nature, IBM physicists and engineers have described the feat of writing and reading data in an
atom of holmium, a rare-earth element. This is a symbolic step forward but proves that this approach works and that we
might one day have atomic data storage. To illustrate what it means, imagine that we can store the entire iTunes library of
35 million songs on a device the size of a credit card. In the paper, the nanoscientists demonstrate the ability to read and
470 Conclusion: The Future of Computing for Data Science?
write one bit of data on one atom. For comparison, today’s hard disk drives use 100,000 to 1 million atoms to store a single bit
of information.
Of course, we cannot avoid discussing quantum computing. “Quantum bits” – or qubits – combine physics with infor-
mation and are the basic units of a quantum computer. Quantum computers use qubits in a computational model based on
the laws of quantum physics. Properly designed quantum algorithms will be capable of solving problems of great complexity
by exploiting quantum superposition and entanglement to access an exponential state space and then amplifying the prob-
ability of calculating the correct response by constructive interference. It was at the beginning of the 1980s, from the encour-
agement of the physicist and Nobel laureate Richard Feynman, that the idea of the design and development of quantum
computers was born: Whereas a classical computer works with bits of values 0 or 1, the quantum computer uses the fun-
damental properties of quantum physics and is based on qubits. With this technological progress, quantum computing
opens the way to the processing of computer tasks whose complexity is beyond the reach of current computers. But let
us start from the beginning.
At the beginning of the twentieth century, the theories of so-called classical physics were unable to explain certain pro-
blems observed by physicists. They therefore needed to be reformulated and enriched. Under the impetus of scientists, they
evolved initially toward a “new mechanics,” which became “wave mechanics” and finally “quantum mechanics.” Quantum
mechanics is the mathematical and physical theory that describes the fundamental structure of matter and the evolution
over time and space of the phenomena of the infinitely small. An essential notion of quantum mechanics is the duality of the
“wave–particle.” Until the 1890s, physicists had considered that the world is composed of two types of objects or particles: on
the one hand those that have a mass (such as electrons, protons, neutrons, atoms, and others) and, on the other hand, those
that do not (such as photons, waves, and others). To the physicists of the time, these particles were governed by the laws of
Newtonian mechanics for those with a mass and by the laws of electromagnetism for waves. We therefore had two theories
of physics to describe two different types of objects. Quantum mechanics invalidates this dichotomy and introduces the
fundamental idea of particle–wave duality. Particles of matter or waves must be treated with the same laws of physics.
The theory of wave mechanics became quantum mechanics a few years later. Big names are associated with the develop-
ment of quantum mechanics, including Niels Bohr, Paul Dirac, Albert Einstein, Werner Heisenberg, Max Planck, Erwin
Schrödinger, and many others. Planck and Einstein, being interested in the radiation emitted by a heated body and in the
photoelectric effect, were the first to understand that the exchanges of light energy could only be performed by “packets.”
Moreover, Einstein obtained the Nobel Prize in physics following the publication of his theory on the quantified aspect of
energy exchanges in 1921. Bohr extended the quantum postulates of Planck and Einstein from light to matter by proposing a
model reproducing the spectrum of the hydrogen atom. He obtained the Nobel Prize in physics in 1922 by defining a model
of the atom that could dictate the behavior of quanta of light. Passing from one energy level to a lower one, an electron
exchanges a quantum of energy. Step by step, rules were found to calculate the properties of atoms, molecules, and their
interactions with light.
From 1925 to 1927, a series of works by several physicists and mathematicians gave substance to two general theories
applicable to these problems:
IBM: “Information is physical.” Computers are, of course, physical machines. It is therefore necessary to take into
account the energy costs generated by calculations and the reading and recording of bits of information as well as
energy dissipation in the form of heat. In a context in which the links between thermodynamics and information were
the subject of many questions, Landauer sought to determine the minimum amount of energy necessary to manipulate
a single bit of information in a given physical system. There should therefore be a limit, today called the Landauer limit
and discovered in 1961, that specifies that any computer system is obliged to dissipate a minimum amount of heat and
therefore consume a minimum amount of electricity. This research was fundamental because it showed that any com-
puter system has a minimum thermal and electrical threshold that cannot be exceeded. Thus, we will reach the min-
imum consumption of a computer chip; it will not be able to release less energy. This concept is not relevant in classical
systems, but scientists explain that this limit will be especially important in designing quantum chips. Recent work by
Charles Henry Bennett at IBM has consisted of a re-examination of the physical basis of information and the applica-
tion of quantum physics to the problems of information flows. His work has played a major role in the development of
an interconnection between physics and information.
For a quantum computer, the qubit is the basic entity that represents, like the bit, the smallest entity allowing manip-
ulation of information. It has two fundamental properties of quantum mechanics: superposition and entanglement.
A quantum object (on a microscopic scale) can exist in an infinite number of states (as long as one does not measure
this state). A qubit can therefore exist in any state between 0 and 1. Qubits can take both the value 0 and the value 1, or
rather “a certain amount of 0 and a certain amount of 1,” as a combination linear of two states denoted | 0> and | 1>,
with the coefficients α and β. Thus, whereas a classic bit describes only two states (0 or 1), the qubit can represent an
infinite number. This is one of the potential advantages of quantum computing from the point of view of information
theory. We can obtain an idea of the superposition of states using the analogy of the lottery ticket: A lottery ticket is
either winning or losing once we know the outcome of the game. However, before the draw, this ticket was neither a
winner nor a loser. It only had a certain probability of being a winner and a certain probability of being a loser; it was
therefore both a winner and a loser at the same time. In the quantum world, all the characteristics of particles can be
subject to this indeterminacy. For example, the position of a particle is uncertain. Before measurement, the particle is
neither at point A nor at point B; it has a certain probability of being at point A and a certain probability of being at
point B. However, after measurement, the state of the particle is well defined: It is either at point A or at point B.
Another amazing property of quantum physics is entanglement. When we consider a system composed of several qubits,
they can sometimes “link their destiny,” that is to say not remain independent of each other even if they are separated in
space (while classical bits are completely independent of each other). This phenomenon is called quantum entanglement. If
we consider a system of two entangled qubits, then the measurement of the state of one of these two qubits gives us an
immediate indication of the result of an observation on the other qubit.
To illustrate this property, we can use another analogy: We can imagine two light bulbs, each in two different houses. By
entangling them, it becomes possible to know the state of one bulb (on or off ) by simply observing the second, because the
two would be immediately linked, or entangled, even if the houses are very far from each other.
The entanglement phenomenon makes it possible to describe correlations between qubits. If we increase the number of
qubits, the number of these correlations increases exponentially: For N qubits, there are 2N correlations. This property gives
the quantum computer the possibility of carrying out manipulations on enormous quantities of values, quantities beyond
the reach of a conventional computer.
The uncertainty principle discovered by Werner Heisenberg in 1927 tells us that whatever our measuring tools, we are
unable to precisely determine both the position and the speed of a quantum object (at the atomic scale). Either we know
exactly where the object is, and the speed will seem to fluctuate and become blurry, or we have a clear idea of the speed,
but its position will escape us.
In his book La quantique autrement: garanti sans équation!, Julien Bobroff describes the quantum experiment in
three acts:
The first moment is before the quantum object behaves like a wave. The Schrödinger equation allows us to accurately
predict how the latter spreads, how fast, in what direction, whether it spreads or contracts. Then, it is decoherence that
makes its appearance. Decoherence happens extremely quickly, almost instantaneously. It is at this precise moment that
the wave comes into contact with a measuring tool (e.g., a fluorescent screen). This wave is forced to interact with the par-
ticles that make up this device. This is the moment when the wave is shrinking. The last step is the random choice among all
the possible states. The draw is related to the shape of the wave function at the time of measurement. In fact, only the shape
of the wave function at the end of the first act dictates how likely it is to appear here or there.
472 Conclusion: The Future of Computing for Data Science?
Another phenomenon of quantum mechanics is the tunnel effect. Bobroff uses the example of a tennis ball. If we throw a
tennis ball against a wall, it will come back to us! In the world of quantum, if the ball is a quantum wave function, it will only
partially bounce against a barrier. A small part can tunnel through to the other side. This implies that if the particle is meas-
ured, it will sometimes materialize on the left of the wall, sometimes on the right.
A quantum computer uses the laws of quantum mechanics to make calculations. It has to be under certain conditions,
sometimes extreme, such as immersing a system in liquid helium to reach temperatures close to absolute zero, or
−273.15 C.
Building a quantum computer relies on the ability to develop a computer chip on which qubits are engraved. From a
technological point of view, there are several ways of constituting qubits; they can be made of atoms, photons, electrons,
molecules, or superconductive metals. In most cases, a quantum computer needs extreme conditions to operate such as
temperatures close to absolute zero. The choice of IBM, for example, is to use superconducting qubits constructed with
aluminum oxides (this technology is also called transmons qubits). As mentioned above, to allow quantum effects
(superposition and entanglement), the qubits must be cooled to a temperature as close as possible to absolute zero (i.e.,
approximately −273 C). At IBM, this operating threshold is approximately 20 mK! IBM demonstrated the ability to design
a single qubit in 2007 and 2016 and announced the availability in the cloud of the first operational physical system with five
qubits and the development environment “QISKit” (Quantum Information Science Kit), allowing the design, testing, and
optimization of algorithms for commercial and scientific applications. The “IBM Q Experience” initiative is a first in the
industrial world. At the time of this writing, IBM now has several quantum computers, including a 127-qubit system, and
has recently published its roadmap.
The number of qubits will progressively increase, but this is not enough. In the race to develop quantum computers, other
components are essential beyond qubits. We can speak of “quantum volume” as a relevant measure of performance and
technological progress. Other measures have also been offered by companies and laboratories. We can define “quantum
advantage” as the point at which quantum computing applications offer a significant and demonstrable practical advantage
that exceeds the capabilities of conventional computers. The concept of “quantum volume” was introduced by IBM in 2017
and is beginning to spread to other manufacturers. Quantum volume is a measure that determines the power of a quantum
computer system, taking into account gate and measurement errors, crosstalk of the device, connectivity of the device, and
efficiency of the circuit compiler. The quantum volume can be used for any noisy intermediate-scale quantum (NISQ) com-
puter system based on gates and circuits. For example, if we lower the error rate of ×10 without adding extra qubits, we can
have a quantum volume increase of 500×. On the contrary, if we add 50 additional qubits but do not decrease error rates, we
will have an increased quantum volume of 0×. Adding qubits is not everything.
The challenges researchers face today are technological, such as stability over time. When we run a quantum algorithm on
a real quantum computer, there are many externalities that can disrupt the quantum state of the program, which is already
fragile. Another technological challenge concerns the quantity of qubits. Every time we increase the capacity of a quantum
computer by one qubit, we reduce its stability. Another challenge is that we will be forced to rethink the entirety of the
algorithms we know to adapt them to quantum computing.
And, of course, we need to be able to run tasks on these machines, which is why IBM has developed the QISKit program-
ming library. This open-source library for the Python language is available at qiskit.org. Its development is very active, and
all contributors, including IBM, regularly update the functionality of this programming environment.
Quantum computers will be added to conventional computers to address problems that are unsolved today. For example,
conventional computers can calculate complex problems that a quantum computer cannot, and there are problems that
both classical and quantum computers can solve. Finally, there are challenges that a conventional computer cannot solve
but that a quantum computer can address. Many applications are possible in the fields of chemistry, materials science,
machine learning, and optimization.
For example, it is difficult for a classical computer to calculate the energy of the caffeine molecule (with 24 atoms) exactly
(that is to say without any approximation), it is a very complex problem. We would need approximately 1048 bits to represent
the energy configuration of a single caffeine molecule at a time t. That is almost the number of atoms contained on earth,
which is 1050. However, we believe it is possible to perform this calculation with 160 qubits. Today, quantum computers are
used to address simple chemistry problems, with a small number of atoms, but the objective is of course to be able to address
much more complex molecules.
But that is not the only limitation. To provide a simple illustration, we can consider the so-called itinerant seller problem,
or the problem of delivery by delivery trucks as a modern example. If we want to choose the most efficient route for a truck to
deliver packages to 5 addresses, there are 12 possible routes, so it is possible to identify the best one. However, as we add
Conclusion: The Future of Computing for Data Science? 473
more addresses, the problem becomes exponentially more difficult – by the time we have 15 deliveries, there are over
43 billion possible routes, making it virtually impossible to find the best. For example, in 71 cities, the number of candidate
paths is greater than 5 × 1080.
Currently, quantum computing is suitable for certain algorithms such as optimization, machine learning, or simulation.
With these types of algorithms, use cases apply in several industrial sectors. Financial services such as portfolio risk opti-
mization, fraud detection, health (drug research, protein study, etc.), supply chains and logistics, chemicals, research for
new materials, and oil exploration are all areas that will be primarily impacted. We can also address the future of medical
research with quantum computing, which should eventually allow the synthesis of new therapeutic molecules. In addition,
if we are to meet the challenge of climate change, we need to solve many problems such as designing better batteries, finding
less energy-intensive ways to grow our food, and planning our supply chains to minimize transport. Solving these problems
effectively requires radically improved computational approaches in areas such as chemistry and materials science as well as
in the fields of optimization and simulation – areas in which classical computing faces serious limitations.
For a conventional computer, considering the product of two numbers and obtaining the result is a very simple operation:
7 × 3 = 21 or 6739 × 892,721 = 6,016,046,819. This remains true even for very large numbers. But the opposite problem is
much more complex. Knowing a large number N, it is more complicated to find P and Q such that P × Q = N. This difficulty
forms the basis of current cryptographic techniques. Yet, it is estimated that a similar problem that would last 1025 days on a
conventional computer could be resolved in a few tens of seconds on a quantum machine. We speak in this case of expo-
nential acceleration. With quantum computers, we can approach problems in an entirely new way by taking advantage of
entanglement, superposition, and interference: modeling the physical procedures of nature, performing many more sce-
nario simulations, and finding better optimization solutions or models in AI and ML processes. There are many cases
of optimization that are eligible for quantum computing, including those in the fields of logistics (shortest path), finance
(risk assessment, evaluation of asset portfolios), marketing, industry, and the design of complex systems. The field of AI is
also an active research field, and learning methods for artificial neural networks are beginning to emerge; thus, it is the
whole of human activities related to the processing of information that is potentially relevant in the future of quantum
computing. The domain of cybersecurity and cryptography is also a subject of attention. The Shor algorithm was demon-
strated over 20 years ago, and it could weaken the encryption commonly used on the internet; we will have to wait until
quantum machines are powerful enough to process this type of calculation. On the other hand, encryption solutions beyond
the reach of this algorithm have already been demonstrated. Quantum technology itself will also provide solutions to protect
data. Therefore, the field of quantum technologies, and quantum computing in particular, is considered strategic.
We can find many use cases in banks and financial institutions to improve trading strategies and management of client
portfolios and to better analyze financial risks. A quantum algorithm in development, for example, could potentially provide
quadratic acceleration when using derivative pricing – a complex financial instrument that requires 10,000 simulations to be
valued on a conventional computer but would only require 100 quantum operations on a quantum device.
Another use case for quantum computing is the optimization of trading. It will be possible for banks to accelerate portfolio
optimizations such as Monte Carlo simulations. The simulation of buying and selling of products (trading) such as deriva-
tives can be improved using quantum computing. The complexity of trading activities in financial markets is skyrocketing.
Investment managers struggle to integrate real constraints such as market volatility and changes in client life events, into
portfolio optimization. Currently, rebalancing of investment portfolios to follow market movements is strongly impacted by
calculation constraints and transaction costs. Quantum technology could help reduce the complexity of today’s business
environments. The combinatorial optimization capabilities of quantum computing can enable investment managers to
improve portfolio diversification, to rebalance portfolio investments to respond to market conditions and investor objectives
more precisely, and to streamline more cost-effective transaction settlement processes. Machine learning is also used for
portfolio optimization and scenario simulation. Banks and financial institutions such as hedge funds are increasingly inter-
ested because they see it as a way to minimize risks while maximizing gains with dynamic products that can adapt according
to new, simulated data. Personalized finance is also an area that is being explored. Customers demand personalized products
and services that can quickly anticipate changing needs and behaviors. Small- and medium-sized financial institutions can
lose customers because of offers that do not favor the customer experience. It is difficult to create analytical models using
behavioral data quickly and precisely enough to target and predict the products that customers need in nearly real time.
A similar problem exists in detecting fraud to find patterns of unusual behavior. Financial institutions are estimated to
lose between $10 billion and $40 billion in revenue annually due to fraud and poor data-management practices. For cus-
tomer targeting and forecast modeling, quantum computing could be a game-changer. The data-modeling capabilities of
474 Conclusion: The Future of Computing for Data Science?
quantum computers are expected to be superior in finding models, performing classifications, and making predictions that
are not possible today with conventional computers due to the challenges of complex data structures.
Another use case in the world of finance is risk analysis. Risk analysis calculations are difficult because it is difficult to
analyze many scenarios. Compliance costs are expected to more than double in the coming years. Financial services institu-
tions are under increasing pressure to balance risk, hedge positions more effectively, and perform a wider range of stress
tests to comply with regulatory requirements. Monte Carlo simulations, the preferred technique for analyzing the impact of
risk and uncertainty in financial models, are currently limited by the scaling of the estimation error. Quantum computers
have the potential to sample data differently by testing more results with greater accuracy, providing quadratic acceleration
for these types of simulations.
Molecular modeling may allow for discoveries such as more efficient lithium batteries. Quantum computing will
empower modeling of atomic interactions much more precisely and at much larger scales; we can use again the example
of the caffeine molecule. New materials will be able to be used everywhere, whether in consumer products, cars, batteries, or
other places. Quantum computing will allow molecular orbit calculations to be performed without approximation. Other
applications are the optimization of a country’s electricity network, more predictive environmental modeling, and the
search for energy sources with lower emissions.
Aeronautics will also be a source of use cases. For each landing of an airplane, hundreds of operations are performed: crew
change, refueling, cabin cleaning, baggage delivery, and inspections; each transaction has suboperations. Refueling
requires an available tanker, a truck driver, and two people to add fuel; it must also be ensured in advance that the tanker
is full. With hundreds of aircraft landing and flights that are sometimes delayed, the problem becomes more and more
complex. It is then necessary to recalculate all of these factors for all planes in real time.
Electric vehicles have a weakness, namely the capacity and speed of charging their batteries. A breakthrough in quantum
computing made by researchers from IBM and the automobile manufacturer Daimler could help meet this challenge. Daim-
ler is very interested in the impact of quantum computing to optimize transport logistics and to predict future materials for
electric mobility, in particular the next generation of batteries. There is every reason to hope that quantum computers will
yield results in the years to come to accurately simulate the chemistry of battery cells, aging processes, and the performance
limits of battery cells.
The problem of the commercial traveler can be extended to many fields such as energy, telecommunications, logistics,
production chains, and resource allocation. For example, in sea freight, there is great complexity in the management of
containers from start to finish. Loading, conveying, delivering, and then unloading in several ports in the world is a
multi-parameter problem that can be addressed by quantum computing.
A better understanding of the interactions between atoms and molecules will make it possible to discover new drugs.
Detailed analysis of DNA sequences will help detect cancer earlier by developing models that can determine how diseases
develop. The advantage of quantum computing will be the ability to analyze the behavior of molecules in detail and on a
scale never reached before. Chemical simulations will allow the discovery of new drugs and better prediction of protein
structures, scenario simulations will better predict the risks of a disease or its spread, the resolution of optimization pro-
blems will optimize the chains of distribution of drugs, and finally the use of AI will speed up diagnoses and analyze genetic
data more precisely.
The data center of tomorrow will be made of heterogeneous systems, which will run heterogeneous workloads. The sys-
tems will be located as close as possible to the data. These heterogeneous systems will be equipped with binary, biologically
inspired, and quantum accelerators. These architectures will be the foundations for addressing challenges. Like an orchestra
conductor, the hybrid cloud will make it possible to set these systems to music thanks to a layer of security and intelligent
automation.
Final Thoughts
As you can see, there are many things coming from the hardware side that will certainly allow machine learning to progress
dramatically. Today, we are at the beginning of broad AI and foundation models. We define foundation models as models
that are trained on large datasets (usually using large-scale self-supervision) and that can be adapted to a wide range of
downstream tasks. The rise of these models (e.g., BERT, DALL-E, GPT-3) represents a paradigm shift. Models are injected
with various data (text, audio, video, images, structured data, etc.); the models train on this data and can then perform
Conclusion: The Future of Computing for Data Science? 475
functions such as answering questions (“Is AI dangerous for humans?”), writing texts (e.g., a philosophical text on a given
subject), generating code, performing object recognition, translating, and other tasks. Training these models requires a large
amount of computing power, with a cost of several million for the creation and training of a single basic model. That is why
progress in hardware will also contribute significantly to the progress of AI.
I hope you have enjoyed reading and learning from this book as much as I have enjoyed writing it.
Further Reading
Ambrogio, S., Narayanan, P., Tsai, H. et al. (2018). Equivalent-accuracy accelerated neural-network training using analog memory.
Nature 558: 60–67.
Athmanathan, A., Stanisavljevic, M., Papandreou, N. et al. (2016). Multilevel-cell phase-change memory: a viable technology. IEEE
Journal of Emerging and Selected Topics in Circuits and Systems 6 (1): 87–100.
Burr, G.W., Brightsky, M.J., Sebastian, A. et al. (2016). Recent progress in phase-change memory technology. IEEE Journal on
Emerging and Selected Topics in Circuits and Systems 6 (2): 146–162.
Burr, G.W., Shelby, R.M., Sebastian, A. et al. (2016). Neuromorphic computing using non-volatile memory. Advances in Physics:
X 2: 89–124.
Boybat, I., Manuel Le Gallo, S.R., Nandakumar, T.M. et al. (2018). Neuromorphic computing with multi-memristive synapses.
Nature Communications 9 (1): 2514. https://doi.org/10.1038/s41467-018-04933-y.
Ceze, L., Nivala, J., and Strauss, K. (2019). Molecular digital data storage using DNA. Nature Reviews Genetics 20 (8): 456–466.
https://doi.org/10.1038/s41576-019-0125-3.
Choi, J., Wang, Z., Venkataramani, S., et al. (2018). PACT: Parameterized clipping activation for quantized neural networks. arXiv:
1805.06085v2 [cs.CV].
Cross, A.W., Bishop, L.S., Sheldon, S., et al. (2019). Validating quantum computers using randomized model circuits. arXiv:
1811.12926v2 [quant-ph].
DeBole, M.V., Taba, B., Amir, A. et al. (2019). TrueNorth: accelerating from zero to 64 million neurons in 10 years. IEEE Computer
52: 20–28.
DeBole, M.V., Taba, B., and Amir, A. et al. (2019). TrueNorth: accelerating from zero to 64 million neurons in 10 years. Computer 52
(5): 20–29.
Egger, D.J., Gutiérrez, R.G., Mestre, J.C. et al. (2019). Credit risk analysis using quantum computers. arXiv: 1907.03044 [quant-ph].
Feynman, R. (1982). Simulating physics with computers. International Journal of Theoretical Physics 21 (6/7).
Gao, Q., Nakamura, H., Gujarati, T.P., et al. (2019). Computational investigations of the lithium superoxide dimer rearrangement
on noisy quantum devices. arXiv: 1906.10675 [quant-ph].
Green, D.G. and William MJ. The future of computing: bits + neurons + qubits. arXiv: 1911.08446 [physics.pop-ph].
Gupta, S., Agrawal, A., Gopalkrishnan, K. et al. (2015). Deep learning with limited numerical precision. In: Proceedings of the 32nd
International Conference on International Conference on Machine Learning – Volume 37 (ICML’15), pp. 1737–1746. JM:R.org.
Harwood, S.M., Trenev, D., Stober, S.T. et al. (2022). Improving the variational quantum eigensolver using variational adiabatic
quantum computing. ACM Transactions on Quantum Computing 3, 1, Article 1 (March 2022). https://doi.org/10.1145/3479197.
Havlíček, V., Córcoles, A.D., Temme, K. et al. (2019). Supervised learning with quantum-enhanced feature spaces. Nature 567:
209–212.
Haensch, W., Gokmen, T., and Puri, R. (2018). The next generation of deep learning hardware: analog computing. Proceedings of the
IEEE 107: 108–122.
Kandala, A., Mezzacapo, A., Temme, K. et al. (ed.) (2017). Hardware-efficient variational quantum eigensolver for small molecules
and quantum magnets. Nature 549: 242–246.
LeCun, Y., Bottou, L., Bengio, Y. et al. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE 86:
2278–2324.
Mackin, C., Tsai, H., Ambrogio, S. et al. (2019). Weight programming in DNN analog hardware accelerators in the presence of
NVM variability. Advanced Electronic Materials 5: 1900026.
Merolla, P.A., Arthur, J.V., Alvarez-Icaza, R. et al. (2014). A million spiking-neuron integrated circuit with scalable communication
network and interface. Science 345 (6197): 668–673.
476 Conclusion: The Future of Computing for Data Science?
Rice, J.E., Gujarati, T.P., Takeshita, T.T. et al. (2020). Quantum chemistry simulations of dominant products in lithium-sulfur
batteries. arXiv: 2001.01120 [physics.chem-ph].
Shannon, C.E. (1940). A Symbolic Analysis of Relay and Switching Circuits. Thesis. MIT, Department of Electrical Engineering.
Shannon, C.E. (1948). A mathematical theory of communication. Bell System Technical Journal 27: 379–423 and 623–656.
Shannon, C.E. and Weaver, W. (1949). The Mathematical Theory of Communication. The University of Illinois Press.
Stamatopoulos, N., Egger, D.J., Sun, Y. et al. (2020). Option pricing using quantum computers. Quantum 4: 291. https://doi.org/
10.22331/q-2020-07-06-291.
Suzuki, Y., Uno, S., Raymond, R. et al. (2020). Amplitude estimation without phase estimation. Quantum Information Processing
19: 75.
Suzuki, Y., Yano, H., Gao, Q., et al. (2019). Analysis and synthesis of feature map for kernel-based quantum classifier. arXiv:
1906.10467 [quant-ph].
Tang, J., Bishop, D., Kim, S. et al. (2018). ECRAM as Scalable Synaptic Cell for High-Speed, Low-Power Neuromorphic Computing.
IEEE-IEDM.
Woerner, S. and Egger, D. J. (2019). Quantum risk analysis. npj Quantum Information 5: 15. https://doi.org/10.1038/s41534-019-
0130-6.
Yuste, R. The discovery of dendritic spines by Cajal. Frontiers in Neuroanatomy 9: 18. https://doi.org/10.3389/fnana.2015.00018.
Zoufal, C., Lucchi, A., and Woerner, S. Quantum generative adversarial networks for learning and loading random distributions.
npj Quantum Information 5: 103 (2019). https://doi.org/10.1038/s41534-019-0223-2.
https://www.research.ibm.com/frontiers/ibm-q.html
https://news.exxonmobil.com/press-release/exxonmobil-and-ibm-advance-energy-sector-application-quantum-computing
477
Index
Note: Page numbers referring to figures should be italics and those referring to tables should be bold
Machine Learning Theory and Applications: Hands-on Use Cases with Python on Classical and Quantum Machines, First Edition. Xavier Vasques.
© 2024 John Wiley & Sons, Inc. Published 2024 by John Wiley & Sons, Inc.
478 Index
j l
James-Stein encoding method 74–75 label encoding method 62, 130
Jenkins 389, 400 Lagrange multipliers 213
installation 397–399 lag variables 79–82
integrate machine learning models using 396–405 language modeling 287
scenario implementation 399–405 Large Models, Large Language Models (LMs/LLMs) 286
JupyterLab 13 LDA (linear discriminant analysis) 7, 110–115, 381, 430, 431
Jupyter Notebooks 13, 33, 77, 98, 182, 353, 379 learning algorithm 177
learning styles, for machine learning 2–9
k methods 9
KDE (kernel density estimation) 257 reinforcement learning 9
Keras 12, 167 semi-supervised learning 9
binary logistic regression with 210–211 supervised learning. see supervised learning
logistic regression with 208–210 unsupervised learning 9
kernel density estimation (KDE) 257 least absolute shrinkage and selection operator (lasso)
kernel functions 212, 217, 303, 306 regression 154–156, 169
kernel trick. see kernel function leave-one-out encoding method 73–74
k-fold cross-validation 4 linear algorithms 131
k-means clustering algorithm 252–255 linear discriminant analysis (LDA) 7, 110–115, 381, 430, 431
k-nearest neighbors (KNN) 274 linear interpolation 91–92
imputation method 93–97 linear regression 68, 137, 176–202
permutation feature importance with 165–166 gradient descent to cost function 177–181
KNN. see k-nearest neighbors implementation 182–202
Kubeadm installation 434–435 math 176–177
kubectl 453 multiple 185–202
kubelet 405 univariate 182–185
kube-proxy 405 linear support vector classification algorithm 157
Kubernetes LLE (locally linear embedding) method 115–123
application to 448–452 L-Measure 219
482 Index
LMs/LLMs (Large Models, Large Language Models) 286 machine learning operations (MLOps) 396–405
l2 norm 43 MAE (mean of the absolute errors) 8
locally linear embedding (LLE) method 115–123 magnetoencephalograms (MEGs) 102
local tangent space alignment (LTSA) 121 Mahalanobis distance transformation 318, 327
loc method 255 make_blobs() function 253
logistic data transformation 43 manifold learners 116
logistic function 202 MAR (missing at random) 92
logistic regression 202–211 Masked Language Modeling (MLM) 287, 288
binary 202–204, 210–211 math 176–177
with Keras on TensorFlow 208–210 matplotlib 10, 101
multinomial. see multinomial logistic regression MaxAbsScaler 40, 57
with sklearn 205–208 maximum-margin hyperplane 212, 306
log loss 8 max-norm 43
lognormal cumulative distribution functions 43, 44 MCAR (missing completely at random) 92
lognormal transformation 43 mean 37, 37
log transformation 44 imputation 90
long short-term memory (LSTM) 233, 242–246 mean encoding methods 66–67, 67
loss functions 7–9, 203, 225–226 mean of the absolute errors (MAE) 8
binary cross-entropy as 352 mean shift clustering 252, 257–259
L2 regularization (ridge regression) 156–157 mean squared error (MSE) 8
LSTM (long short-term memory) 233, 242–246 median imputation 90
LTSA (local tangent space alignment) 121 MEGs (magnetoencephalograms) 102
M-estimator encoding methods 76
m method(s)
machine learning (ML) 1. see also machine learning backward difference encoding 72–73
algorithms bag-of-words 274, 278, 280, 296
application 443–446 binary encoding 64, 64–65
data preprocessing for 36 data rescaling 18–19
Docker. see Docker, for machine learning diff() 85
feature engineering. see feature engineering techniques elastic net 157
goal 4 embedded. see embedded methods
handling missing values in 88–97 encoding. see encoding methods
hephAIstos for running. see hephAIstos, for machine feature extraction. see feature extraction method
learning feature selection. see feature selection method
learning styles for 2–9 filter 132–146
models 392–393, 431–432, 454–463 head() 255
production. see production, machine learning in KNN imputation 93–97
Python tools for 9–13 loc 255
quantum advantage for 303 quantum kernel 307
quantum computing and. see quantum computing regularization 154
Red Hat OpenShift to 452–454 shift() 79
workflow 2 weight of evidence 68–70
machine learning algorithms 94 wrapper. see wrapper methods
artificial neural networks 223–249 MICE (multivariate imputation by chained equation)
with hephAIstos 264–269 imputations 92–93, 93
linear regression 176–202 microservice approach 375–376
logistic regression 202–211 mini-batch k-means clustering algorithm 252, 255–256
many more algorithms to explore 249–251 Minkowski norm 43
in quantum computing. see quantum machine learning MinMaxScaler 39, 57, 120
rule-based 176 misclassification rate 6
support vector machine 211–222 missing at random (MAR) 92
unsupervised 251–264 missing completely at random (MCAR) 92
Index 483