0% found this document useful (0 votes)
24 views

Module 1

Uploaded by

aryan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

Module 1

Uploaded by

aryan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 140

Module 1: Introduction

to Machine Learning
Course Name: Machine Learning & Blockchain
Course Code: IoTCSBCC701
Faculty Incharge: Dr. Sheeba P. S.
Module 1
Introduction to Machine Learning
Introduction:-
Introduction:- What Is Learning? When Do We Need Machine Learning?
Types of Learning, Relations to Other Fields
Basic Terminology & Framework:- Machine Learning Terminology
Roadmap for building machine learning -- Preprocessing, Training, and Model selection,
Evaluating and Predicting
Python for machine learning -- Packages for scientific computing, data science, and machine
learning
Data Preprocessing:-
Dealing with missing data, Handling Categorical data, Partitioning a dataset into separate
training and test datasets, Bringing features onto the same scale, Select meaningful features
What is Machine Learning?
In the real world, we are surrounded by
humans who can learn everything from
their experiences with their learning
capability, and we have computers or
machines which work on our instructions.
But can a machine also learn from
experiences or past data like a human
does? So here comes the role of Machine
Learning.
Machine Learning is said as a subset of
artificial intelligence that is mainly
concerned with the development of
algorithms which allow a computer to
learn from the data and past experiences
on their own. The term machine learning
was first introduced by Arthur Samuel in
1959.
What is Machine Learning?
“Learning is any process by which a system improves performance from
experience.”-Herbert Simon
Definition by Tom Mitchell(1998):
Machine Learning is the study of algorithms that
• improve their performance P
• at some task T
• with experience E.
A well-defined learning task is given by <P,T,E>.
Magic?
No, its more like gardening

• Seeds = Algorithms
• Nutrients = Data
• Gardener = You
• Plants = Programs

In Machine Learning, algorithms are like the seeds that the gardener plants. The algorithms are
trained on data and monitored to see how they perform. Just like a gardener adjusts the conditions
for each plant, the machine learning algorithm can be adjusted to improve its performance. This can
involve tweaking parameters, adding or removing features, or trying different algorithms altogether.
How does Machine Learning work

A Machine Learning system learns from historical data, builds the prediction models, and
whenever it receives new data, predicts the output for it.
The accuracy of predicted output depends upon the amount of data, as the huge amount of
data helps to build a better model which predicts the output more accurately.
Suppose we have a complex problem, where we need to perform some predictions, so instead
of writing a code for it, we just need to feed the data to generic algorithms, and with the help
of these algorithms, machine builds the logic as per the data and predict the output.
How does Machine Learning work…
The block diagram explains the working of Machine Learning algorithm:
Features of Machine Learning:

• Machine learning uses data to detect various patterns in a given dataset.


• It can learn from past data and improve automatically.
• It is a data-driven technology.
• Machine learning is much similar to data mining as it also deals with the huge
amount of the data.
Need for Machine Learning
➢ The need for machine learning is that it is capable of doing tasks that are too
complex for a person to implement directly.
➢ As a human, we have some limitations as we cannot access the huge amount of
data manually, so for this, we need some computer systems and here comes the
machine learning to make things easy for us.
➢ We can train machine learning algorithms by providing them the huge amount
of data and let them explore the data, construct the models, and predict the
required output automatically.
➢ The performance of the machine learning algorithm depends on the amount of
data, and it can be determined by the cost function.
➢ With the help of machine learning, we can save both time and money.
Need for Machine Learning…
The importance of machine learning can be easily understood by its use cases:
Currently, machine learning is used in self-driving cars, cyber fraud detection,
face recognition, and friend suggestion by Facebook, etc.
Various top companies such as Netflix and Amazon have build machine
learning models that are using a vast amount of data to analyze the user
interest and recommend product accordingly.
What is Learning?
Learning is the process of converting experience into expertise or knowledge.

The input to a learning algorithm is training data, representing experience, and


the output is some expertise, which usually takes the form of another computer
program that can perform some task.
What Is Learning?...
Example from animal learning:
Bait Shyness { Rats Learning to Avoid Poisonous Baits}
When rats encounter food items with novel look or smell, they will first eat very
small amounts, and subsequent feeding will depend on the flavor of the food and
its physiological effect.
A typical machine learning task: Program a machine that learns how to filter spam
e-mails.
The machine will simply memorize all previous e-mails that had been labeled as
spam e-mails by the human user.
When a new e-mail arrives, the machine will search for it in the set of previous
spam e-mails. If it matches one of them, it will be trashed. Otherwise, it will be
moved to the user's inbox folder.
What Is Learning?...
“learning by memorization" approach is sometimes useful, it lacks an important
aspect of learning systems - the ability to label unseen e-mail messages.
A successful learner should be able to progress from individual examples to broader
generalization. This is also referred to as inductive reasoning or inductive inference.
After the rats encounter an example of a certain type of food, they apply their
attitude toward it on new, unseen examples of food of similar smell and taste.
What Is Learning?...
To achieve generalization in the spam filtering task, the learner can scan the
previously seen e-mails, and extract a set of words whose appearance in an e-mail
message is indicative of spam.
Then, when a new e-mail arrives, the machine can check whether one of the
suspicious words appears in it, and predict its label accordingly.
Such a system would potentially be able to correctly predict the label of unseen e-
mails.
However, inductive reasoning might lead us to false conclusions.
What Is Learning?...
Example from animal learning
Pigeon Superstition: In an experiment performed by the psychologist B. F. Skinner, he placed a
bunch of hungry pigeons in a cage.
An automatic mechanism had been attached to the cage, delivering food to the pigeons at
regular intervals with no reference whatsoever to the birds' behavior. The hungry pigeons
went around the cage, and when food was first delivered, it found each pigeon engaged in
some activity (pecking, turning the head, etc.).
The arrival of food reinforced each bird's specific action, and consequently, each bird tended
to spend some more time doing that very same action. That, in turn, increased the chance
that the next random food delivery would find each bird engaged in that activity again.
What results is a chain of events that reinforces the pigeons' association of the delivery of the
food with whatever chance actions they had been performing when it was first delivered.
They subsequently continue to perform these same actions diligently.
What Is Learning?...
While human learners can rely on common sense to filter out random meaningless
learning conclusions, once we export the task of learning to a machine, we must
provide well defined crisp principles that will protect the program from reaching
senseless or useless conclusions.
One distinguishing feature between the bait shyness learning and the pigeon
superstition is the incorporation of prior knowledge that biases the learning
mechanism. This is also referred to as inductive bias. The pigeons in the experiment
are willing to adopt any explanation for the occurrence of food.
However, the rats “know" that food cannot cause an electric shock and that the co-
occurrence of noise with some food is not likely to affect the nutritional value of that
food. The rats' learning process is biased toward detecting some kind of patterns
while ignoring other temporal correlations between events.
When Do We Use Machine Learning?
ML is used when:
• Human expertise does not exist (navigating on Mars)
• Humans can’t explain their expertise (speech recognition)
• Models must be customized (personalized medicine)
• Models are based on huge amounts of data (genomics)

Learning isn’t always useful: •There is no need to “learn” to calculate payroll


When Do We Need Machine Learning?
Two aspects of a given problem may call for the use of programs that learn and
improve on the basis of their “experience": the problem's complexity and the need for
adaptivity.
Tasks That Are Too Complex to Program.
Tasks Performed by Animals/Humans: There are numerous tasks that we human beings
perform routinely, yet our introspection concerning how we do them is not sufficiently
elaborate to extract a well defined program.
Examples of such tasks include driving, speech recognition, and image understanding.
In all of these tasks, state of the art machine learning programs, programs that “learn
from their experience," achieve quite satisfactory results, once exposed to sufficiently
many training examples.
When Do We Need Machine Learning?...
Tasks beyond Human Capabilities: Analysis of very large and complex data sets:
astronomical data, turning medical archives into medical knowledge, weather
prediction, analysis of genomic data, Web search engines, and electronic
commerce.
With more and more available digitally recorded data, it becomes obvious that
there are treasures of meaningful information buried in data archives that are way
too large and too complex for humans to make sense of.
Learning to detect meaningful patterns in large and complex data sets is a
promising domain in which the combination of programs that learn with the almost
unlimited memory capacity and ever increasing processing speed of computers
opens up new horizons.
When Do We Need Machine Learning?...

Adaptivity. One limiting feature of programmed tools is their rigidity - once the
program has been written down and installed, it stays unchanged.
However, many tasks change over time or from one user to another.
Machine learning tools - programs whose behavior adapts to their input data - offer
a solution to such issues; they are, by nature, adaptive to changes in the
environment they interact with.
Typical successful applications of machine learning to such problems include
programs that decode handwritten text, where a fixed program can adapt to
variations between the handwriting of different users; spam detection programs,
adapting automatically to changes in the nature of spam e-mails; and speech
recognition programs.
When Do We Need Machine Learning?...
Machine learning is particularly useful in the following scenarios:
1. Large Amounts of Data: When there is a vast amount of data to analyze, machine learning can process
and draw insights from it more efficiently than manual methods.
2. Complex Patterns and Relationships: When patterns or relationships in the data are too complex for
traditional statistical methods to uncover, machine learning algorithms can be employed to detect these
hidden patterns.
3. Automation of Repetitive Tasks: For tasks that are repetitive and rule-based, machine learning can
automate processes, reducing the need for human intervention and minimizing errors.
4. Dynamic Environments: In environments where conditions change rapidly and systems need to adapt
in real-time (e.g., financial markets, autonomous vehicles), machine learning models can adjust and
learn from new data continuously.
5. Personalization and Recommendations: When there's a need to provide personalized experiences or
recommendations based on user behavior and preferences, such as in e-commerce, streaming services,
or online advertising.
6. Predictive Analytics: For making predictions based on historical data, such as forecasting sales,
predicting equipment failures, or estimating customer churn.
When Do We Need Machine Learning?...

• Image and Speech Recognition: When dealing with tasks that require recognizing patterns in
images, audio, or video, such as facial recognition, speech-to-text conversion, and object detection.
• Natural Language Processing (NLP): For understanding and generating human language, machine
learning is used in applications like chatbots, sentiment analysis, language translation, and text
summarization.
• Anomaly Detection: When identifying unusual patterns that might indicate fraud, security breaches,
or other significant deviations from the norm, such as in cybersecurity, financial monitoring, and quality
control.
• Optimizing Operations: In logistics, supply chain management, and resource allocation, machine
learning can help optimize operations by predicting demand, identifying inefficiencies, and
recommending improvements.
• Enhanced Customer Support: For automating and improving customer service through chatbots
and virtual assistants that can handle a large volume of inquiries and provide quick, accurate
responses.
•Medical Diagnosis and Treatment: In healthcare, machine learning can assist in diagnosing
diseases, predicting patient outcomes, and personalizing treatment plans based on patient data.
Every machine learning algorithm has three components:
1. Representation: what the model looks like; how knowledge is represented.
2. Evaluation: how good models are differentiated; how programs are evaluated.
Evaluation is done using an evaluation function
3. Optimization: the process for finding good models; how programs are generated
Representation
• Decision trees
• Sets of rules / Logic programs
• Instances
• Graphical models (Bayes/Markov nets)
• Neural networks
• Support vector machines
• Model ensembles
Etc.
Evaluation
• Accuracy
• Precision and recall
• Squared error
• Likelihood
• Posterior probability
• Cost / Utility
• Margin
• Entropy
• K-L divergence
Etc.
Optimization
• Combinatorial optimization
E.g.: Greedy search
• Convex optimization
E.g.: Gradient descent
• Constrained optimization
E.g.: Linear programming
Types of Machine Learning

➢ Supervised learning
➢ Unsupervised learning
➢ Reinforcement learning
Supervised Learning
Supervised learning is a type of machine learning method in which we provide sample
labelled data to the machine learning system in order to train it, and on that basis, it
predicts the output.
The system creates a model using labelled data to understand the datasets and learn about
each data, once the training and processing are done then we test the model by providing
a sample data to check whether it is predicting the exact output or not.
The goal of supervised learning is to map input data with the output data. The supervised
learning is based on supervision, and it is the same as when a student learns things in the
supervision of the teacher.
The example of supervised learning is spam filtering.
Supervised learning can be grouped further in two categories of algorithms:
• Classification
• Regression
Unsupervised Learning
Unsupervised learning is a learning method in which a machine learns without any
supervision.
The training is provided to the machine with the set of data that has not been labelled,
classified, or categorized, and the algorithm needs to act on that data without any
supervision.
The goal of unsupervised learning is to restructure the input data into new features or a
group of objects with similar patterns.
In unsupervised learning, we don't have a predetermined result. The machine tries to find
useful insights from the huge amount of data.
It can be further classifieds into two categories of algorithms:
• Clustering
• Association
Reinforcement Learning
Reinforcement learning is a feedback-based learning method, in which a learning
agent gets a reward for each right action and gets a penalty for each wrong action.
The agent learns automatically with these feedbacks and improves its performance.
In reinforcement learning, the agent interacts with the environment and explores it.
The goal of an agent is to get the most reward points, and hence, it improves its
performance.
The robotic dog, which automatically learns the movement of his arms, is an example
of Reinforcement learning.
The three different types of machine learning
Supervised Learning
The main goal in supervised learning is to learn a model from labelled training data
that allows us to make predictions about unseen or future data.
Here, the term "supervised" refers to a set of training examples (data inputs) where
the desired output signals (labels) are already known.
The figure summarizes a typical supervised learning workflow, where the labelled
training data is passed to a machine learning algorithm for fitting a predictive model
that can make predictions on new, unlabelled data inputs:
In supervised learning, models are trained using labelled dataset, where the model
learns about each type of data. Once the training process is completed, the model is
tested on the basis of test data (a subset of the training set), and then it predicts the
output.
Suppose we have a dataset of different types of shapes
which includes square, rectangle, triangle, and Polygon.
Now the first step is that we need to train the model for
each shape.

• If the given shape has four sides, and all the sides are
equal, then it will be labelled as a Square.
• If the given shape has three sides, then it will be labelled
as a triangle.
• If the given shape has six equal sides then it will be
labelled as hexagon.
Now, after training, we test our model using the test set,
and the task of the model is to identify the shape. The
machine is already trained on all types of shapes, and when
it finds a new shape, it classifies the shape on the basis of
number of sides, and predicts the output.
1. Regression

Regression algorithms are used if there is a


relationship between the input variable and the
output variable.
It is used for the prediction of continuous variables,
such as Weather forecasting, Market Trends, etc.

2. Classification

Classification algorithms are used when the output


variable is categorical, which means there are two
classes such as Yes-No, Male-Female, True-false, etc.
Considering the example of email
spam filtering, we can train a model
using a supervised machine learning
algorithm on a corpus of labelled
emails, which are correctly marked as
spam or non-spam, to predict whether
a new email belongs to either of the
two categories.

A supervised learning task with


discrete class labels, such as in the
previous email spam filtering example,
is also called a classification task.

Another subcategory of supervised


learning is regression, where the
outcome signal is a continuous value.
Unsupervised Learning
Unsupervised learning is a machine learning technique in which models are not
supervised using training dataset.
Instead, models itself find the hidden patterns and insights from the given data. It can
be compared to learning which takes place in the human brain while learning new
things.
Unsupervised learning is a type of machine learning in which models are trained using
unlabelled dataset and are allowed to act on that data without any supervision.
Example: Suppose the unsupervised learning algorithm is given an input dataset
containing images of different types of cats and dogs.
The algorithm is never trained upon the given dataset, which means it does not have
any idea about the features of the dataset. The task of the unsupervised learning
algorithm is to identify the image features on their own.
Unsupervised learning algorithm will perform this task by clustering the image
dataset into the groups according to similarities between images.
Here, we have taken an unlabeled input data, which means it is not categorized and corresponding outputs
are also not given. Now, this unlabeled input data is fed to the machine learning model in order to train it.
Firstly, it will interpret the raw data to find the hidden patterns from the data and then will apply suitable
algorithms such as k-means clustering, Decision tree, etc.

Once it applies the suitable algorithm, the algorithm divides the data objects into groups according to the
similarities and difference between the objects.
Clustering: Clustering is a method of grouping the objects
into clusters such that objects with most similarities
remains into a group and has less or no similarities with
the objects of another group. Cluster analysis finds the
commonalities between the data objects and categorizes
them as per the presence and absence of those
commonalities.

Association: An association rule is an unsupervised


learning method which is used for finding the
relationships between variables in the large database. It
determines the set of items that occurs together in the
dataset. Association rule makes marketing strategy more
effective. Such as people who buy X item (suppose a
bread) are also tend to purchase Y (Butter/Jam) item. A
typical example of Association rule is Market Basket
Analysis.
Reinforcement Learning
Reinforcement Learning is a feedback-based Machine learning technique in which an
agent learns to behave in an environment by performing the actions and seeing the
results of actions.
For each good action, the agent gets positive feedback, and for each bad action, the
agent gets negative feedback or penalty.
In Reinforcement Learning, the agent learns automatically using feedbacks without
any labeled data, unlike supervised learning.
Since there is no labeled data, so the agent is bound to learn by its experience only.
Reinforcement Learning…
The agent interacts with the environment and explores it by itself.
The primary goal of an agent in reinforcement learning is to improve the
performance by getting the maximum positive rewards.
The agent learns with the process of hit and trial, and based on the experience, it
learns to perform the task in a better way.
Hence, we can say that "Reinforcement learning is a type of machine learning
method where an intelligent agent (computer program) interacts with the
environment and learns to act within that."
How a Robotic dog learns the movement of his arms is an example of
Reinforcement learning.
Agent: Agent is the model that is being trained via reinforcement learning
Environment: The training situation that the model must optimize to is called its
environment
Action: All possible steps that can be taken by the model
State: The current position/ condition returned by the model
Reward: To help the model move in the right direction, it is rewarded/points are
given to it to appraise some action
Policy: Policy determines how an agent will behave at any time. It acts as a mapping
between Action and present State
Example: Suppose there is an AI agent present
within a maze environment, and his goal is to
find the diamond. The agent interacts with the
environment by performing some actions, and
based on those actions, the state of the agent
gets changed, and it also receives a reward or
penalty as feedback.
The agent continues doing these three things
(take action, change state/remain in the same
state, and get feedback), and by doing these
actions, he learns and explores the
environment.
The agent learns that what actions lead to
positive feedback or rewards and what actions
lead to negative feedback penalty. As a positive
reward, the agent gets a positive point, and as a
penalty, it gets a negative point.
In reinforcement learning, the goal is to develop a system (agent) that improves its
performance based on interactions with the environment.
Since the information about the current state of the environment typically also
includes a so-called reward signal, we can think of reinforcement learning as a field
related to supervised learning.
In reinforcement learning, this feedback is not the correct ground truth label or
value, but a measure of how well the action was measured by a reward function.
Through its interaction with the environment, an agent can then use reinforcement
learning to learn a series of actions that maximizes this reward via an exploratory
trial-and-error approach or deliberative planning.
A popular example of reinforcement learning is a chess engine. Here, the agent
decides upon a series of moves depending on the state of the board (the
environment), and the reward can be defined as win or lose at the end of the game:

The agent in reinforcement learning tries to


maximize the reward through a series of
interactions with the environment.

Each state can be associated with a positive or


negative reward, and a reward can be defined as
accomplishing an overall goal, such as winning
or losing a game of chess.

For instance, in chess, the outcome of each


move can be thought of as a different state of
the environment.
To explore the chess example further, let's think of visiting certain configurations
on the chess board as being associated with states that will more likely lead to
winning—for instance, removing an opponent's chess piece from the board or
threatening the queen.
Other positions, however, are associated with states that will more likely result in
losing the game, such as losing a chess piece to the opponent in the following turn.
Now, in the game of chess, the reward (either positive for winning or negative for
losing the game) will not be given until the end of the game.
In addition, the final reward will also depend on how the opponent plays. For
example, the opponent may sacrifice the queen but eventually win the game.
Learning tasks
The field of machine learning has branched into several subfields dealing with
different types of learning tasks.
Supervised versus Unsupervised
Active versus Passive Learners
Helpfulness of the Teacher
Online versus Batch Learning Protocol
Supervised versus Unsupervised
Consider the task of learning to detect spam e-mail versus the task of anomaly
detection.
For the spam detection task, we consider a setting in which the learner receives
training e-mails for which the label spam/not-spam is provided.
On the basis of such training the learner should figure out a rule for labeling a newly
arriving e-mail message.
In contrast, for the task of anomaly detection, all the learner gets as training is a large
body of e-mail messages (with no labels) and the learner's task is to detect \unusual"
messages.
Supervised learning describes a scenario in which the “experience," a training
example, contains significant information (say, the spam/not-spam labels) that is
missing in the unseen “test examples“ to which the learned expertise is to be applied.
In this setting, the acquired expertise is aimed to predict that missing information for
the test data.
In such cases, we can think of the environment as a teacher that “supervises" the
learner by providing the extra information (labels).
In unsupervised learning, however, there is no distinction between training and test
data.
The learner processes input data with the goal of coming up with some summary, or
compressed version of that data.
Clustering a data set into subsets of similar objects is a typical example of such a task.
Active versus Passive Learners
An active learner interacts with the environment at training time, say, by posing
queries or performing experiments, while a passive learner only observes the
information provided by the environment (or the teacher) without influencing or
directing it.
Learner of a spam filter is usually passive - waiting for users to mark the e-mails
coming to them.
In an active setting, one could imagine asking users to label specific e-mails chosen by
the learner, or even composed by the learner, to enhance its understanding of what
spam is.
Helpfulness of the Teacher
When one thinks about human learning, of a baby at home or a student at school, the
process often involves a helpful teacher, who is trying to feed the learner with the
information most useful for achieving the learning goal.
In contrast, when a scientist learns about nature, the environment, playing the role of the
teacher, can be best thought of as passive {apples drop, stars shine, and the rain falls} without
regard to the needs of the learner.
We model such learning scenarios by postulating that the training data (or the learner's
experience) is generated by some random process. This is the basic building block in the
branch of \statistical learning.
" Finally, learning also occurs when the learner's input is generated by an adversarial
teacher." This may be the case in the spam filtering example (if the spammer makes an effort
to mislead the spam filtering designer) or in learning to detect fraud.
One also uses an adversarial teacher model as a worst-case scenario, when no milder setup
can be safely assumed.
If you can learn against an adversarial teacher, you are guaranteed to succeed interacting any
odd teacher
Online versus Batch Learning Protocol
The last parameter we mention is the distinction between situations in which the
learner has to respond online, throughout the learning process, and settings in which
the learner has to engage the acquired expertise only after having a chance to
process large amounts of data.
For example, a stockbroker has to make daily decisions, based on the experience
collected so far.
He may become an expert over time, but might have made costly mistakes in the
process.
In contrast, in many data mining settings, the learner - the data miner - has large
amounts of training data to play with before having to output conclusions.
Relations to Other Fields
As an interdisciplinary field, machine learning shares common threads with the
mathematical fields of statistics, information theory, game theory, and optimization.
It is naturally a subfield of computer science, as our goal is to program machines so
that they will learn.
In a sense, machine learning can be viewed as a branch of AI (Artificial Intelligence),
since, after all, the ability to turn experience into expertise or to detect meaningful
patterns in complex sensory data is a cornerstone of human (and animal)
intelligence.
In contrast with traditional AI, machine learning is not trying to build automated
imitation of intelligent behavior, but rather to use the strengths and special abilities
of computers to complement human intelligence, often performing tasks that fall
way beyond human capabilities.
For example, the ability to scan and process huge databases allows machine
learning programs to detect patterns that are outside the scope of human
perception.
The component of experience, or training, in machine learning often refers to data
that is randomly generated.
The task of the learner is to process such randomly generated examples toward
drawing conclusions that hold for the environment from which these examples are
picked.
This description of machine learning highlights its close relationship with statistics.
There are, however, a few significant differences of emphasis; if a doctor comes up
with the hypothesis that there is a correlation between smoking and heart disease, it
is the statistician's role to view samples of patients and check the validity of that
hypothesis (this is the common statistical task of hypothesis testing).
In contrast, machine learning aims to use the data gathered from samples of patients
to come up with a description of the causes of heart disease.
The hope is that automated techniques may be able to figure out meaningful patterns
(or hypotheses) that may have been missed by the human observer.
In contrast with traditional statistics, in machine learning in general, algorithmic
considerations play a major role.
Machine learning is about the execution of learning by computers; hence algorithmic
issues are pivotal.
We develop algorithms to perform the learning tasks and are concerned with their
computational efficiency.
Another difference is that while statistics is often interested in asymptotic behaviour
(like the convergence of sample-based statistical estimates as the sample sizes grow
to infinity), the theory of machine learning focuses on finite sample bounds.
Given the size of available samples, machine learning theory aims to figure out the
degree of accuracy that a learner can expect on the basis of such samples.
While in statistics it is common to work under the assumption of certain
presubscribed data models (such as assuming the normality of data-generating
distributions, or the linearity of functional dependencies), in machine learning the
emphasis is on working under a “distribution-free" setting, where the learner
assumes as little as possible about the nature of the data distribution and allows
the learning algorithm to figure out which models best approximate the data-
generating process.
Basic Terminology & Framework
Relationships
Machine learning systems uses Relationships between Inputs to produce Predictions.

In algebra, a relationship is often written as y = ax + b:


• y is the label we want to predict
• a is the slope of the line
• x are the input values
• b is the intercept
With ML, a relationship is written as y = b + wx:
• y is the label we want to predict
• w is the weight (the slope)
• x are the features (input values)
• b is the intercept
Machine Learning Labels
In Machine Learning terminology, the label is the thing we want to predict.
It is like the y in a linear graph:

Algebra Machine Learning


y = ax + b y = b + wx

Machine Learning Features


In Machine Learning terminology, the features are the input.
They are like the x values in a linear graph:

Algebra Machine Learning


y = ax + b y = b + wx

Sometimes there can be many features (input values) with different weights:
y = b + w1x1 + w2x2 + w3x3 + w4x4
Notation and Conventions
The table depicts an excerpt of the
Iris dataset, which is a classic example
in the field of machine learning.
The Iris dataset contains the
measurements of 150 Iris flowers
from three different species—Setosa,
Versicolor, and Virginica.
Here, each flower example represents
one row in our dataset, and the
flower measurements in centimeters
are stored as columns, which we also
call the features of the dataset:
To keep the notation and implementation simple yet efficient, we will make use of some
of the basics of linear algebra.
We will follow the common convention to represent each example as a separate row in
a feature matrix, X, where each feature is stored as a separate column.
The Iris dataset, consisting of 150 examples and four features, can then be written as
a 150 × 4 matrix, 𝑿 ∈ ℝ150x4 :

superscript i refer to the ith training example, and the subscript j refer to the jth
dimension of the training dataset.
We will use lowercase, bold-face letters to refer to vectors (𝒙 ∈ ℝ𝒏×𝟏) and
uppercase, bold-face letters to refer to matrices (𝑿 ∈ ℝ𝒏×𝒎) .
To refer to single elements in a vector or matrix, we will write the letters in italics (𝑥(𝑛) or
𝑥𝑚 (𝑛) , respectively).
For example, 𝑥1(150) refers to the first dimension of flower example 150, the sepal
length.
Thus, each row in this feature matrix represents one flower instance and can be
written as a four-dimensional row vector, 𝒙(𝑖) ∈ ℝ𝟏×𝟒 :

And each feature dimension is a 150-dimensional column vector, 𝒙(𝑖) ∈ ℝ𝟏50×𝟏 . For
example:
Similarly, we will store the target variables (here, class labels) as a 150-dimensional
column vector:
Machine Learning Terminology
Machine learning is a vast field and also very interdisciplinary as it brings together
many scientists from other areas of research. As it happens, many terms and
concepts have been rediscovered or redefined and may already be familiar to you
but appear under different names.
Training example: A row in a table representing the dataset and synonymous with
an observation, record, instance, or sample (in most contexts, sample refers to a
collection of training examples).
Training: Model fitting, for parametric models similar to parameter estimation.
Feature, abbrev. x: A column in a data table or data (design) matrix.
Synonymous with predictor, variable, input, attribute, or covariate.
Target, abbrev. y: Synonymous with outcome, output, response variable, dependent
variable, (class) label, and ground truth.
Loss function: Often used synonymously with a cost function. Sometimes the loss
function is also called an error function.

In some literature, the term "loss" refers to the loss measured for a single data point,
and the cost is a measurement that computes the loss (average or summed) over the
entire dataset.
A roadmap for building machine learning systems
The diagram shows a typical workflow for using machine learning in predictive modeling:
Preprocessing
Raw data rarely comes in the form and shape that is necessary for the optimal
performance of a learning algorithm. Thus, the preprocessing of the data is one of the
most crucial steps in any machine learning application.
If we take the Iris flower dataset as an example, we can think of the raw data as a
series of flower images from which we want to extract meaningful features.
Useful features could be the color, hue, and intensity of the flowers, or the height,
length, and width of the flowers.
Many machine learning algorithms also require that the selected features are on the
same scale for optimal performance, which is often achieved by transforming the
features in the range [0, 1] or a standard normal distribution with zero mean and unit
variance.
Preprocessing…
Some of the selected features may be highly correlated and therefore redundant to a
certain degree.
In those cases, dimensionality reduction techniques are useful for compressing the
features onto a lower dimensional subspace.
Reducing the dimensionality of our feature space has the advantage that less storage
space is required, and the learning algorithm can run much faster.
In certain cases, dimensionality reduction can also improve the predictive
performance of a model if the dataset contains a large number of irrelevant features
(or noise); that is, if the dataset has a low signal-to-noise ratio.
To determine whether our machine learning algorithm not only performs well on
the training dataset but also generalizes well to new data, we also want to
randomly divide the dataset into a separate training and test dataset.
We use the training dataset to train and optimize our machine learning model,
while we keep the test dataset until the very end to evaluate the final model.
Training and selecting a predictive model
Each classification algorithm has its inherent biases, and no single classification model
enjoys superiority if we don't make any assumptions about the task.
It is essential to compare at least a handful of different algorithms in order to train
and select the best performing model.
But before we can compare different models, we first have to decide upon a metric to
measure performance.
One commonly used metric is classification accuracy, which is defined as the
proportion of correctly classified instances.
How do we know which model performs well on the final test dataset and real-world
data if we don't use this test dataset for the model selection, but keep it for the final
model evaluation?
In order to address the issue embedded in this question, different techniques
summarized as "cross-validation" can be used.
In cross-validation, we further divide a dataset into training and validation subsets in
order to estimate the generalization performance of the model.
Finally, we also cannot expect that the default parameters of the different learning
algorithms provided by software libraries are optimal for our specific problem task.
Therefore, we will make frequent use of hyperparameter optimization techniques that
help us to fine-tune the performance of our model.
We can think of those hyperparameters as parameters that are not learned from the
data but represent the knobs of a model that we can turn to improve its performance.
Evaluating models and predicting unseen data instances
After we have selected a model that has been fitted on the training dataset, we can
use the test dataset to estimate how well it performs on this unseen data to estimate
the so-called generalization error.
If we are satisfied with its performance, we can now use this model to predict new,
future data.
It is important to note that the parameters for feature scaling, dimensionality
reduction etc, are solely obtained from the training dataset, and the same parameters
are later reapplied to transform the test dataset, as well as any new data instances—
the performance measured on the test data may be overly optimistic otherwise.
Using Python for Machine Learning
Python is one of the most popular programming languages for data science and a
large number of useful libraries for scientific computing and machine learning have
been developed.
Although the performance of interpreted languages, such as Python, for
computation-intensive tasks is inferior to lower-level programming languages,
extension libraries such as NumPy and SciPy have been developed that build upon
lower-layer Fortran and C implementations for fast vectorized operations on
multidimensional arrays.
For machine learning programming tasks, we will mostly refer to the scikit-learn
library, which is currently one of the most popular and accessible open source
machine learning libraries
For subfield of machine learning called deep learning, we will use the latest version of
the TensorFlow library, which specializes in training so-called deep neural network
models very efficiently by utilizing graphics cards.
Installing Python and packages from the Python
Package Index
Python is available for all three major operating systems—Microsoft Windows, macOS, and
Linux—and the installer, as well as the documentation, can be downloaded from the official
Python website: https://www.python.org.
Strongly advise that you use Python 3.7 or newer.
The additional packages can be installed via the pip installer program, which has been part of
the Python Standard Library since Python 3.3.
More information about pip can be found at https://docs.python.org/3/installing/index.html.
After we have successfully installed Python, we can execute pip from the terminal to install
additional Python packages:
pip install SomePackage
Already installed packages can be updated via the --upgrade flag:
pip install SomePackage --upgrade
Using the Anaconda Python distribution and
package manager
A highly recommended alternative Python distribution for scientific computing is Anaconda
by Continuum Analytics.
Anaconda is a free—including commercial use—enterprise-ready Python distribution that
bundles all the essential Python packages for data science, math, and engineering into one
user-friendly, cross-platform distribution.
The Anaconda installer can be downloaded at https://docs.anaconda.com/anaconda/install/,
and an Anaconda quick start guide is available at https://docs.anaconda.com/anaconda/user-
guide/getting-started/.
After successfully installing Anaconda, we can install new Python packages using
the following command:
conda install SomePackage
Existing packages can be updated using the following command:
conda update SomePackage
Packages for scientific computing, data science, and
machine learning
We will mainly use NumPy's multidimensional arrays to store and manipulate data.
Occasionally, we will make use of pandas, which is a library built on top of NumPy
that provides additional higher-level data manipulation tools that make working
with tabular data even more convenient. To augment learning experience and
visualize quantitative data, we will use the very customizable Matplotlib library.
Please make sure that the version numbers of your installed packages are equal to,
or greater than, the version numbers given below to ensure that the code examples
run correctly:
NumPy 1.17.4
SciPy 1.3.1
scikit-learn 0.22.0
Matplotlib 3.1.0
pandas 0.25.3
Data Preprocessing
Data preprocessing is a process of preparing the raw data and making it suitable for a
machine learning model. It is the first and crucial step while creating a machine
learning model.
The quality of the data and the amount of useful information that it contains are key
factors that determine how well a machine learning algorithm can learn.
A real-world data generally contains noises, missing values, and maybe in an unusable
format which cannot be directly used for machine learning models.
Data preprocessing is required tasks for cleaning the data and making it suitable for a
machine learning model which also increases the accuracy and efficiency of a
machine learning model.
It is absolutely critical to ensure that we examine and preprocess a dataset before we
feed it to a learning algorithm.
Data Preprocessing…
It involves below steps:
➢ Getting the dataset
➢ Importing libraries
➢ Importing datasets
➢ Finding Missing Data
➢ Encoding Categorical Data
➢ Splitting dataset into training and test set
➢ Feature scaling
1) Get the Dataset
To create a machine learning model, the first thing we required is a dataset as a
machine learning model completely works on data.
The collected data for a particular problem in a proper format is known as the
dataset.
Dataset may be of different formats for different purposes, such as, if we want to
create a machine learning model for business purpose, then dataset will be different
with the dataset required for a liver patient. So each dataset is different from another
dataset.
To use the dataset in our code, we usually put it into a CSV file. However, sometimes,
we may also need to use an HTML or xlsx file.
1) Get the Dataset…
What is a CSV File?
CSV stands for "Comma-Separated Values" files; it is a file format which allows us
to save the tabular data, such as spreadsheets. It is useful for huge datasets and
can use these datasets in programs.
Here we will use a demo dataset for data preprocessing, and for practice, it can be
downloaded from here, "https://www.superdatascience.com/pages/machine-
learning.
For real-world problems, we can download datasets online from various sources
such as https://www.kaggle.com/uciml/datasets,
https://archive.ics.uci.edu/ml/index.php etc.

We can also create our dataset by gathering data using various API with Python
and put that data into a .csv file.
2) Importing Libraries
In order to perform data preprocessing using Python, we need to import some
predefined Python libraries. These libraries are used to perform some specific jobs.
There are three specific libraries that we will use for data preprocessing, which are:
Numpy: Numpy Python library is used for including any type of mathematical
operation in the code.
It is the fundamental package for scientific calculation in Python. It also supports to
add large, multidimensional arrays and matrices.
So, in Python, we can import it as:
import numpy as nm
Here we have used nm, which is a short name for Numpy, and it will be used in the
whole program.
2) Importing Libraries…

Matplotlib: The second library is matplotlib, which is a Python 2D plotting library, and
with this library, we need to import a sub-library pyplot.
This library is used to plot any type of charts in Python for the code.
It will be imported as below:
import matplotlib.pyplot as mpt
Here we have used mpt as a short name for this library.
2) Importing Libraries…
Pandas: The last library is the Pandas library, which is one of the most famous Python
libraries and used for importing and managing the datasets. It is an open-source data
manipulation and analysis library. It will be imported as below:
Example:
import pandas as pd

mydataset = {
'cars': ["BMW", "Volvo", "Ford"],
'passings': [3, 7, 2]
}
Here, we have used pd as a short name for this library.
myvar = pd.DataFrame(mydataset)

print(myvar)
3) Importing the Datasets
Now we need to import the datasets which we have collected for our machine
learning project.
But before importing a dataset, we need to set the current directory as a working
directory.
To set a working directory in Spyder IDE, we need to follow the below steps
1. Save your Python file in the directory which contains dataset.
2. Go to File explorer option in Spyder IDE, and select the required directory.
3. Click on F5 button or run option to execute the file.
Here, in the below image, we can see the Python file along with required dataset.
Now, the current folder is set as a working directory.
3) Importing the Datasets…
Pandas is a Python library used for working with data sets.
It has functions for analyzing, cleaning, exploring, and manipulating data.
Pandas can clean messy data sets, and make them readable and relevant.
read_csv() function:
To import the dataset, we will use read_csv() function of pandas library, which is used to read a
csv file and performs various operations on it.
Using this function, we can read a csv file locally as well as through an URL.
We can use read_csv function as below:
data_set= pd.read_csv('Dataset.csv’)
Here, data_set is a name of the variable to store our dataset, and inside the function, we have
passed the name of our dataset.
Once we execute the above line of code, it will successfully import the dataset in our code.
We can also check the imported dataset by clicking on the section variable explorer,
and then double click on data_set. Consider the below image:

As in the above image, indexing is started from 0, which is the default indexing in Python. We can also
change the format of our dataset by clicking on the format option.
Extracting dependent and independent variables:
In machine learning, it is important to distinguish the matrix of features
(independent variables) and dependent variables from dataset. In our dataset, there
are three independent variables that are Country, Age, and Salary, and one is a
dependent variable which is Purchased.
Extracting independent variable:
To extract an independent variable, we will use iloc[ ] method of Pandas library. It is
used to extract the required rows and columns from the dataset.
x= data_set.iloc[:,:-1].values
In the above code, the first colon(:) is used to take all the rows, and the second
colon(:) is for all the columns. Here we have used :-1, because we don't want to take
the last column as it contains the dependent variable. So by doing this, we will get
the matrix of features.
By executing the above code, we will get output as
[['India' 38.0 68000.0]
['France' 43.0 45000.0]
['Germany' 30.0 54000.0]
['France' 48.0 65000.0]
['Germany' 40.0 nan]
['India' 35.0 58000.0]
['Germany' nan 53000.0]
['France' 49.0 79000.0]
['India' 50.0 88000.0]
['France' 37.0 77000.0]] :

As we can see in the above output, there are only three variables.
Extracting dependent variable:
To extract dependent variables, again, we will use Pandas .iloc[] method.
y= data_set.iloc[:,3].values
Here we have taken all the rows with the last column only. It will give the array of
dependent variables.
By executing the above code, we will get output as:
array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
dtype=object)

Note: If you are using Python language for machine learning, then extraction is mandatory,
but for R language it is not required
Dealing with missing data
It is not uncommon in real-world applications for our training examples to be
missing one or more values for various reasons.
There could have been an error in the data collection process, certain
measurements may not be applicable, or particular fields could have been simply
left blank in a survey, for example.
We typically see missing values as blank spaces in our data table or as placeholder
strings such as NaN, which stands for "not a number," or NULL (a commonly used
indicator of unknown values in relational databases)
Most computational tools are unable to handle such missing values or will produce
unpredictable results if we simply ignore them.
Therefore, it is crucial that we take care of those missing values before we proceed
with further analyses.
4) Handling Missing data:
The next step of data preprocessing is to handle missing data in the datasets.
We typically see missing values as blank spaces in our data table or as placeholder strings
such as NaN, which stands for "not a number," or NULL.
If our dataset contains some missing data, then it may create a huge problem for our
machine learning model.
Ways to handle missing data:
There are mainly two ways to handle missing data, which are:
1. By deleting the particular row:
The first way is used to commonly deal with null values.
In this way, we just delete the specific row or column which consists of null values.
But this way is not so efficient and removing data may lead to loss of information which will
not give the accurate output.
2. By calculating the mean:
In this way, we will calculate the mean of that column or row which contains any missing
value and will put it on the place of missing value.
This strategy is useful for the features which have numeric data such as age, salary, year, etc.
Eliminating training examples or features with missing values
One of the easiest ways to deal with missing data is
simply to remove the corresponding features Pandas DataFrame.dropna() Syntax
(columns) or training examples (rows) from the Syntax: DataFrameName.dropna(axis=0,
dataset entirely; how=’any’, thresh=None, subset=None,
Rows with missing values can easily be dropped via inplace=False)
the dropna method:
Parameter Value Description
>>> df.dropna(axis=0)
axis 0 Optional, default 0.
A B C D 1 0 and 'index'removes ROWS that contains NULL
0 1.0 2.0 3.0 4.0 'index' values
'columns' 1 and 'columns' removes COLUMNS that
Similarly, we can drop columns that have at least contains NULL values
one NaN in any row by setting the axis argument to
1: how 'all' Optional, default 'any'. Specifies whether to
'any' remove the row or column when ALL values are
>>> df.dropna(axis=1) NULL, or if ANY value is NULL.
A B thresh Number Optional, Specifies the number of NOT NULL
0 1.0 2.0 import pandas as pd values required to keep the row.
subset List Optional, specifies where to look for NULL values
1 5.0 6.0
df = pd.read_csv('data.csv') inplace True Optional, default False. If True: the removing is
2 10.0 11.0 False done on the current DataFrame. If False: returns
newdf = df.dropna() a copy where the removing is done.
The dropna method supports several additional parameters that can come in handy:
# only drop rows where all columns are NaN
# (returns the whole array here since we don’t have a row with all values NaN)
>>> df.dropna(how='all’)
A B C D Although the removal of missing data
0 1.0 2.0 3.0 4.0 seems to be a convenient approach,
it also comes with certain
1 5.0 6.0 NaN 8.0 disadvantages; for example, we may
2 10.0 11.0 12.0 NaN end up removing too many samples,
# drop rows that have fewer than 4 real values which will make a reliable analysis
>>> df.dropna(thresh=4) impossible.
A B C D
Or, if we remove too many feature
0 1.0 2.0 3.0 4.0
columns, we will run the risk of losing
# only drop rows where NaN appear in specific columns (here: 'C') valuable information that our
>>> df.dropna(subset=['C’]) classifier needs to discriminate
A B C D between classes
0 1.0 2.0 3.0 4.0
2 10.0 11.0 12.0 NaN
Imputing missing values
Often, the removal of training examples or dropping of entire feature columns is
simply not feasible, because we might lose too much valuable data.
In this case, we can use different interpolation techniques to estimate the missing
values from the other training examples in our dataset.
One of the most common interpolation techniques is mean imputation, where we
simply replace the missing value with the mean value of the entire feature column.
A convenient way to achieve this is by using the SimpleImputer class from scikit-learn
>>> from sklearn.impute import SimpleImputer
>>> import numpy as np
>>> imr = SimpleImputer(missing_values=np.nan, strategy='mean')
>>> imr = imr.fit(df.values)
>>> imputed_data = imr.transform(df.values)
>>> imputed_data
array([[ 1., 2., 3., 4.],
[[ 5., 6., 7.5, 8.],
10., 11., 12., 6.]])
Here, we replaced each NaN value with the corresponding mean, which is separately calculated for
each feature column.
Other options for the strategy parameter are median or most_frequent, where the latter
replaces the missing values with the most frequent values.
This is useful for imputing categorical feature values, for example, a feature column that stores an
encoding of color names, such as red, green, and blue.
Alternatively, an even more convenient way to impute missing values is by using pandas' fillna
method and providing an imputation method as an argument.
For example, using pandas, we could achieve the same mean imputation directly in the DataFrame
object via the following command: Parameter Value Description
value Number Required, Specifies the value to replace the NULL values
>>> df.fillna(df.mean()) String with.
Dictionary This can also be values for the entire row or column.
Series
DataFrame
method 'backfill' Optional, default None'. Specifies the method to use when
'bfill' replacing
'pad'
'ffill'
None
The fillna() method replaces the NULL values with a
axis 0 Optional, default 0. The axis to fill the NULL values along
specified value.
1
Syntax 'index'
dataframe.fillna(value, method, axis, inplace, limit, 'columns'
downcast) inplace True Optional, default False. If True: the replacing is done on the
Example False current DataFrame. If False: returns a copy where the
replacing is done.
import pandas as pd
limit Number Optional, default None. Specifies the maximum number of
df = pd.read_csv('data.csv') None NULL values to fill (if method is specified)
newdf = df.fillna(222222) downcast Dictionary Optional, a dictionary of values to fill for specific data types
Replace NULL values with the number 222222 None
The SimpleImputer class belongs to the so-called transformer classes in scikit-learn, which are used for
data transformation.
The two essential methods of those estimators are fit and transform.
The fit method is used to learn the parameters from the training data, and the transform method uses
those parameters to transform the data. Any data array that is to be transformed needs to have the
same number of features as the data array that was used to fit the model.
Figure illustrates how a transformer, fitted on the training data, is used to transform a training dataset as well as a new test
dataset:
To handle missing values, we will use Scikit-learn library in our code, which contains
various libraries for building machine learning models. Here we will
use Imputer class of sklearn.preprocessing library.
#handling missing data (Replacing missing data with the mean value)
from sklearn.preprocessing import Imputer
imputer= Imputer(missing_values ='NaN', strategy='mean', axis = 0)
#Fitting imputer object to the independent variables x.
imputerimputer= imputer.fit(x[:, 1:3])
#Replacing missing data with the calculated mean value
x[:, 1:3]= imputer.transform(x[:, 1:3])
Output:

array([['India', 38.0, 68000.0],


['France', 43.0, 45000.0],
['Germany', 30.0, 54000.0],
['France', 48.0, 65000.0],
['Germany', 40.0, 65222.22222222222],
['India', 35.0, 58000.0],
['Germany', 41.111111111111114, 53000.0],
['France', 49.0, 79000.0],
['India', 50.0, 88000.0],
['France', 37.0, 77000.0]], dtype=object
In the output, the missing values have been replaced with the means of rest column
values
Identifying missing values in tabular data
A simple example data frame from a
comma-separated values (CSV) file :
>>> import pandas as pd >>> # If you are using Python
2.7, you need
>>> from io import StringIO >>> # to convert the string to
>>> csv_data = \ unicode:
>>> # csv_data =
... '''A,B,C,D unicode(csv_data)
>>> df =
... 1.0,2.0,3.0,4.0
pd.read_csv(StringIO(csv_data))
... 5.0,6.0,,8.0 >>> df
A B C D
... 10.0,11.0,12.0,''' 0 1.0 2.0 3.0 4.0
1 5.0 6.0 NaN 8.0
2 10.0 11.0 12.0 NaN
Using the preceding code, we read CSV-formatted data into a pandas DataFrame via the
read_csv function and noticed that the two missing cells were replaced by NaN.
The StringIO function in the preceding code example was simply used for the purposes of
illustration. It allowed us to read the string assigned to csv_data into a pandas DataFrame
as if it was a regular CSV file on our hard drive.
For a larger DataFrame, it can be tedious to look for missing values manually; in this case,
we can use the isnull method to return a DataFrame with Boolean values that indicate
whether a cell contains a numeric value (False) or if data is missing (True).
Using the sum method, we can then return the number of missing values per column as
follows:
>>> df.isnull().sum()
A0
B0
C1D1
dtype: int 64
This way, we can count the number of missing values per column.
Convenient data handling with pandas' data frames
Although scikit-learn was originally developed for working with NumPy arrays only, it
can sometimes be more convenient to preprocess data using pandas' DataFrame.
Most scikit-learn functions support DataFrame objects as inputs, but since NumPy
array handling is more mature in the scikit-learn API, it is recommended to use
NumPy arrays when possible.
You can always access the underlying NumPy array of a DataFrame via the values
attribute before you feed it into a scikit-learn estimator:
>>> df.values
array([[ 1., 2., 3., 4.],
[ 5., 6., nan, 8.],
[ 10., 11., 12., nan]])
Handling categorical data
So far, we have only been working with numerical values. However, it is not
uncommon for real-world datasets to contain one or more categorical feature
columns.
When we are talking about categorical data, we have to further distinguish between
ordinal and nominal features.
Ordinal features can be understood as categorical values that can be sorted or
ordered.
For example, t-shirt size would be an ordinal feature, because we can define an order:
XL > L > M.
In contrast, nominal features don't imply any order and, to continue with the previous
example, we could think of t-shirt color as a nominal feature since it typically doesn't
make sense to say that, for example, red is larger than blue.
Categorical data encoding with pandas
Before we explore different techniques for handling such categorical data, let's
create a new DataFrame to illustrate the problem:

The newly created DataFrame contains a nominal feature (color), an ordinal feature (size), and a numerical
feature (price) column. The class labels (assuming that we created a dataset for a supervised learning task)
are stored in the last column.
5) Encoding Categorical data:

Categorical data is data which has some categories such as, in our dataset; there are
two categorical variable, Country, and Purchased.
Since machine learning model completely works on mathematics and numbers, but
if our dataset would have a categorical variable, then it may create trouble while
building the model.
So it is necessary to encode these categorical variables into numbers.
For Country variable:
Firstly, we will convert the country variables into categorical data.
To do this, we will use LabelEncoder() class from preprocessing library.
#Categorical data
#for Country Variable
Output:
from sklearn.preprocessing import LabelEncoder
label_encoder_x= LabelEncoder() Out[15]:
array([[2, 38.0, 68000.0],
x[:, 0]= label_encoder_x.fit_transform(x[:, 0]) [0, 43.0, 45000.0],
[1, 30.0, 54000.0],
Explanation: [0, 48.0, 65000.0],
In above code, we have imported LabelEncoder class of sklearn [1, 40.0, 65222.22222222222],
library. This class has successfully encoded the variables into [2, 35.0, 58000.0],
digits. [1, 41.111111111111114,
But in our case, there are three country variables, and as we can 53000.0],
see in the above output, these variables are encoded into 0, 1, [0, 49.0, 79000.0],
and 2. By these values, the machine learning model may assume [2, 50.0, 88000.0],
that there is some correlation between these variables which will [0, 37.0, 77000.0]],
produce the wrong output. So to remove this issue, we will dtype=object)
use dummy encoding.
Dummy Variables:
Dummy variables are those variables which have values 0 or 1. The 1 value gives the
presence of that variable in a particular column, and rest variables become 0.
With dummy encoding, we will have a number of columns equal to the number of
categories.
In our dataset, we have 3 categories so it will produce three columns having 0 and 1
values. For Dummy Encoding, we will use OneHotEncoder class
of preprocessing library.

#for Country Variable


from sklearn.preprocessing import LabelEncoder, OneHotEncoder
label_encoder_x= LabelEncoder()
x[:, 0]= label_encoder_x.fit_transform(x[:, 0])
#Encoding for dummy variables
onehot_encoder= OneHotEncoder(categorical_features= [0])
x= onehot_encoder.fit_transform(x).toarray()
Output:array([[0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.80000000e+01,
6.80000000e+04],
[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.30000000e+01,
4.50000000e+04],
[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 3.00000000e+01,
5.40000000e+04],
[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.80000000e+01,
6.50000000e+04],
[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 4.00000000e+01,
6.52222222e+04],
[0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.50000000e+01,
5.80000000e+04],
[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 4.11111111e+01,
5.30000000e+04],
[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.90000000e+01,
7.90000000e+04],
[0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 5.00000000e+01,
8.80000000e+04],
[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.70000000e+01,
7.70000000e+04]])
In the above output, all the variables are encoded into numbers 0 and 1 and divided into three columns.
It can be seen more clearly in the variables explorer section, by clicking on x option
as:
For Purchased Variable:
labelencoder_y= LabelEncoder()
y= labelencoder_y.fit_transform(y)
For the second categorical variable, we will only use labelencoder object of
LableEncoder class. Here we are not using OneHotEncoder class because the
purchased variable has only two categories yes or no, and which are automatically
encoded into 0 and 1.

Output:
Out[17]: array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])
It can also be seen as:
6) Splitting the Dataset into the Training set and Test set
In machine learning data preprocessing, we divide our dataset into a training set and
test set. This is one of the crucial steps of data preprocessing as by doing this, we can
enhance the performance of our machine learning model.
Suppose, if we have given training to our machine learning model by a dataset and
we test it by a completely different dataset. Then, it will create difficulties for our
model to understand the correlations between the models.
If we train our model very well and its training accuracy is also very high, but we
provide a new dataset to it, then it will decrease the performance.
So we always try to make a machine learning model which performs well with the
training set and also with the test dataset.
We can define these datasets as:

Training Set: A subset of dataset to train the machine learning model, and we already know the output.

Test set: A subset of dataset to test the machine learning model, and by using the test set, model
predicts the output.
For splitting the dataset, use the code:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)

Explanation:
• The first line is used for splitting arrays of the dataset into random train and test subsets.
• In the second line, we have used four variables for our output that are
• x_train: features for the training data
• x_test: features for testing data
• y_train: Dependent variables for training data
• y_test: Independent variable for testing data
• In train_test_split() function, we have passed four parameters in which first two are for
arrays of data, and test_size is for specifying the size of the test set.
• The test_size maybe .5, .3, or .2, which tells the dividing ratio of training and testing sets.
• The last parameter random_state is used to set a seed for a random generator so that you
always get the same result, and the most used value for this is 42.
Output:
By executing the above code, we will get 4 different variables, which can be seen
under the variable explorer section.

As we can see in the above image, the x and y variables are divided into 4 different
variables with corresponding values
7) Feature Scaling
Feature scaling is the final step of data
preprocessing in machine learning.
Decision trees and random forests are two of
the very few machine learning algorithms
where we don't need to worry about feature
scaling. Those algorithms are scale invariant.
Majority of machine learning and optimization
algorithms behave much better if features are
on the same scale.
Assume that we have two features where one
feature is measured on a scale from 1 to 10 and
the second feature is measured on a scale from
1 to 100,000, respectively.
Feature scaling is a technique to standardize
the independent variables of the dataset in a
specific range.
In feature scaling, we put our variables in the
same range and in the same scale so that no
any variable dominate the other variable.
Consider the given dataset:
As we can see, the age and salary column values are not on the same scale. A
machine learning model is based on Euclidean distance, and if we do not scale the
variable, then it will cause some issue in our machine learning model.
Euclidean distance is given as:

If we compute any two values from


age and salary, then salary values will
dominate the age values, and it will
produce an incorrect result.

So to remove this issue, we need to


perform feature scaling for machine
learning.
7) Feature Scaling…
There are two ways to perform feature scaling in machine learning:
Standardization

Normalization
Normalization refers to the rescaling of the features to a range of [0, 1], which is a special
case of min-max scaling.

The min-max scaling procedure is implemented in scikit-learn and can be used as follows:
>>> from sklearn.preprocessing import MinMaxScaler
>>> mms = MinMaxScaler()
>>> X_train_norm = mms.fit_transform(X_train)
>>> X_test_norm = mms.transform(X_test)

Although normalization via min-max scaling is a commonly used technique that is


useful when we need values in a bounded interval, standardization can be more
practical for many machine learning algorithms, especially for optimization algorithms
such as gradient descent.
Many linear models, such as the logistic regression and SVM initialize the weights
to 0 or small random values close to 0.
Using standardization, we center the feature columns at mean 0 with standard
deviation 1 so that the feature columns have the same parameters as a standard
normal distribution (zero mean and unit variance), which makes it easier to learn
the weights.
Standardization maintains useful information about outliers and makes the
algorithm less sensitive to them in contrast to min-max scaling, which scales the
data to a limited range of values.
The following table illustrates the difference between the two commonly used feature scaling
techniques, standardization and normalization, on a simple example dataset consisting of
numbers 0 to 5:

>>> ex = np.array([0, 1, 2, 3, 4, 5])


>>> print('standardized:', (ex - ex.mean()) / ex.std())
standardized: [-1.46385011 -0.87831007 -0.29277002 0.29277002
0.87831007 1.46385011]
>>> print('normalized:', (ex - ex.min()) / (ex.max() - ex.min()))
normalized: [ 0. 0.2 0.4 0.6 0.8 1. ]
7) Feature Scaling…
Here, we will use the standardization method for our dataset.
For feature scaling, we will import StandardScaler class
of sklearn.preprocessing library as
from sklearn.preprocessing import StandardScaler
Now, we will create the object of StandardScaler class for independent variables or
features. And then we will fit and transform the training dataset.
st_x= StandardScaler()
x_train= st_x.fit_transform(x_train)
For test dataset, we will directly apply transform() function instead of fit_transform()
because it is already done in training set.
x_test= st_x.transform(x_test)
7) Feature Scaling…
Output:
By executing the above lines of code,
we will get the scaled values for
x_train and x_test as:
x_train:
x_test:

As we can see in the above output, all the variables are scaled between values -1 to 1
Note: Here, we have not scaled the dependent variable because there are only two values 0
and 1. But if these variables will have more range of values, then we will also need to scale
those variables.
Other, more advanced methods for feature scaling are available from scikitlearn, such
as the RobustScaler. The RobustScaler is especially helpful and recommended if
we are working with small datasets that contain many outliers.

Similarly, if the machine learning algorithm applied to this dataset is prone to


overfitting, the RobustScaler can be a good choice.

Operating on each feature column independently, the RobustScaler removes the


median value and scales the dataset according to the 1st and 3rd quartile of the
dataset (that is, the 25th and 75th quantile, respectively) such that more extreme
values and outliers become less pronounced.

More information about the RobustScaler in the official scikit-learn


documentation at
https://scikitlearn.org/stable/modules/generated/sklearn.p
reprocessing.RobustScaler.html
Combining all the steps:
Now, in the end, we can combine all the steps together to make our complete code more
understandable.
# importing libraries #encoding for purchased variable
#Fitting imputer object to the independent
import numpy as nm labelencoder_y= LabelEncoder()
varibles x.
import matplotlib.pyplot as mtp y= labelencoder_y.fit_transform(y)
imputerimputer= imputer.fit(x[:, 1:3])
import pandas as pd
#Replacing missing data with the calculated # Splitting the dataset into training
#importing datasets and test set.
mean value
data_set= pd.read_csv('Dataset.csv') from sklearn.model_selection import
x[:, 1:3]= imputer.transform(x[:, 1:3])
train_test_split
#Extracting Independent Variable x_train, x_test, y_train, y_test=
#for Country Variable
x= data_set.iloc[:, :-1].values train_test_split(x, y, test_size= 0.2,
from sklearn.preprocessing import
LabelEncoder, OneHotEncoder random_state=0)
#Extracting Dependent variable
label_encoder_x= LabelEncoder()
y= data_set.iloc[:, 3].values #Feature Scaling of datasets
x[:, 0]= label_encoder_x.fit_transform(x[:, 0])
#handling missing data(Replacing from sklearn.preprocessing import
missing data with the mean value) StandardScaler
#Encoding for dummy variables
from sklearn.preprocessing import st_x= StandardScaler()
onehot_encoder=
Imputer x_train= st_x.fit_transform(x_train)
OneHotEncoder(categorical_features= [0])
imputer= Imputer(missing_values x_test= st_x.transform(x_test)
x= onehot_encoder.fit_transform(x).toarray()
='NaN', strategy='mean', axis = 0)
In the above code, we have included all the data preprocessing steps together.
But there are some steps or lines of code which are not necessary for all machine
learning models.
So we can exclude them from our code to make it reusable for all models.
Selecting meaningful features
If we notice that a model performs much better on a training dataset than on the
test dataset, this observation is a strong indicator of overfitting.
Overfitting means the model fits the parameters too closely with regard to the
particular observations in the training dataset, but does not generalize well to new
data; we say that the model has a high variance.
The reason for the overfitting is that our model is too complex for the given training
data.
Common solutions to reduce the generalization error are as follows:
• Collect more training data
• Introduce a penalty for complexity via regularization
• Choose a simpler model with fewer parameters
• Reduce the dimensionality of the data
Collecting more training data is often not applicable.
L1 and L2 Regularization
Regularization techniques are essential in machine learning to prevent overfitting, improve model
generalization, and ensure that the model performs well on unseen data. Two commonly used
regularization techniques are L1 and L2 regularization.
What is Regularization?
Regularization involves adding a penalty term to the loss function used to train a machine learning
model. This penalty term discourages the model from fitting the noise in the training data and helps in
controlling the complexity of the model.

Regularization of an estimator works by trading


increased bias for reduced variance.

An effective regularize will be the one that makes the


best trade between bias and variance, and the end-
product of the tradeoff should be a significant
reduction in variance at minimum expense to bias. In
simpler terms, this would mean low variance without
immensely increasing the bias value.
L1 Regularization (Lasso)
L1 regularization adds the absolute value of the coefficients to the loss function.
Formula:
For a linear regression model, the loss function with L1 regularization is given by:

Lasso Regression (Least Absolute Shrinkage and Selection Operator) adds “Absolute value of magnitude” of
coefficient, as penalty term to the loss function.
Sparse Solutions: L1 regularization tends to produce sparse solutions, meaning many of the coefficients are
driven to zero. This is useful for feature selection, as it effectively selects a subset of the most important
features.
Feature Selection: Because L1 regularization can zero out coefficients, it can be used to automatically select
features that are most relevant to the predictive model.
▪ L1 regularization is that it is easy to implement and can be trained as a one-shot thing, meaning that once
it is trained you are done with it and can just use the parameter vector and weights.
▪ L1 regularization is robust in dealing with outliers. It creates sparsity in the solution (most of the
coefficients of the solution are zero), which means the less important features or noise terms will be zero. It
makes L1 regularization robust to outliers.
Sparse solutions with L1 regularization

Since the L1 penalty is the sum of the absolute weight coefficients (remember that the L2 term is
quadratic), we can represent it as a diamond-shape budget

The contour of the cost function


touches the L1 diamond at 𝑤1 = 0 .
Since the contours of an L1 regularized
system are sharp, it is more likely that
the optimum—that is, the intersection
between the ellipses of the cost
function and the boundary of the L1
diamond—is located on the axes, which
encourages sparsity.
L2 Regularization (Ridge)
L2 regularization adds the squared value of the coefficients to the loss function.
For a linear regression model, the loss function with L2 regularization is given by:

Ridge regression adds “squared magnitude of the coefficient” as penalty term to the loss function. Here
the box part in the above image represents the L2 regularization element/term.

Small Coefficients: L2 regularization tends to shrink the coefficients, but not necessarily to zero. This means
that while it reduces the impact of less important features, it usually retains all features in the model.
Smooth Solutions: L2 regularization tends to produce smoother models, which can be advantageous in
preventing overfitting when there are many correlated features.

Ridge regression performs better when all the input features influence the output, and all with weights are
of roughly equal size.
L2 regularization can learn complex data patterns
A geometric interpretation of L2 regularization

L2 regularization adds a penalty term to the


cost function that effectively results in less
extreme weight values compared to a model
trained with an unregularized cost function.

The quadratic L2 regularization term is


represented by the shaded ball. The larger the
value of the regularization parameter, 𝜆 , gets,
the faster the penalized cost grows, which leads
to a narrower L2 ball. For example, if we
increase the regularization parameter towards
infinity, the weight coefficients will become
effectively zero, denoted by the center of the L2
ball.

Our goal is to minimize the sum of the unpenalized cost plus the penalty term, which can be understood as
adding bias and preferring a simpler model to reduce the variance in the absence of sufficient training data to
fit the model.
Sequential feature selection algorithms
An alternative way to reduce the complexity of the model and avoid overfitting is
dimensionality reduction via feature selection, which is especially useful for
unregularized models.
There are two main categories of dimensionality reduction techniques: feature
selection and feature extraction.
Via feature selection, we select a subset of the original features, whereas in feature
extraction, we derive information from the feature set to construct a new feature
subspace.
Sequential feature selection algorithms are a family of greedy search algorithms that
are used to reduce an initial d-dimensional feature space to a k-dimensional feature
subspace where k<d.
The motivation behind feature selection algorithms is to automatically select a subset
of features that are most relevant to the problem, to improve computational
efficiency, or to reduce the generalization error of the model by removing irrelevant
features or noise, which can be useful for algorithms that don't support
regularization.
A classic sequential feature selection algorithm is sequential backward selection
(SBS), which aims to reduce the dimensionality of the initial feature subspace with a
minimum decay in the performance of the classifier to improve upon computational
efficiency.
In certain cases, SBS can even improve the predictive power of the model if a model
suffers from overfitting.
SBS sequentially removes features from the full feature subset until the new feature
subspace contains the desired number of features.
In order to determine which feature is to be removed at each stage, we need to
define the criterion function, J, that we want to minimize.
The criterion calculated by the criterion function can simply be the difference in
performance of the classifier before and after the removal of a particular feature.
Then, the feature to be removed at each stage can simply be defined as the feature
that maximizes this criterion; or in more simple terms, at each stage we eliminate
the feature that causes the least performance loss after removal
Sequential Backward Selection (SBS) Algorithm

SBS algorithm has not been implemented in scikit-learn yet.


Greedy search algorithms
Greedy algorithms make locally optimal choices at each stage of a combinatorial
search problem and generally yield a suboptimal solution to the problem, in contrast
to exhaustive search algorithms, which evaluate all possible combinations and are
guaranteed to find the optimal solution.
An exhaustive search is often computationally not feasible, whereas greedy
algorithms allow for a less complex, computationally more efficient solution.
Questions??

You might also like