Module 1
Module 1
to Machine Learning
Course Name: Machine Learning & Blockchain
Course Code: IoTCSBCC701
Faculty Incharge: Dr. Sheeba P. S.
Module 1
Introduction to Machine Learning
Introduction:-
Introduction:- What Is Learning? When Do We Need Machine Learning?
Types of Learning, Relations to Other Fields
Basic Terminology & Framework:- Machine Learning Terminology
Roadmap for building machine learning -- Preprocessing, Training, and Model selection,
Evaluating and Predicting
Python for machine learning -- Packages for scientific computing, data science, and machine
learning
Data Preprocessing:-
Dealing with missing data, Handling Categorical data, Partitioning a dataset into separate
training and test datasets, Bringing features onto the same scale, Select meaningful features
What is Machine Learning?
In the real world, we are surrounded by
humans who can learn everything from
their experiences with their learning
capability, and we have computers or
machines which work on our instructions.
But can a machine also learn from
experiences or past data like a human
does? So here comes the role of Machine
Learning.
Machine Learning is said as a subset of
artificial intelligence that is mainly
concerned with the development of
algorithms which allow a computer to
learn from the data and past experiences
on their own. The term machine learning
was first introduced by Arthur Samuel in
1959.
What is Machine Learning?
“Learning is any process by which a system improves performance from
experience.”-Herbert Simon
Definition by Tom Mitchell(1998):
Machine Learning is the study of algorithms that
• improve their performance P
• at some task T
• with experience E.
A well-defined learning task is given by <P,T,E>.
Magic?
No, its more like gardening
• Seeds = Algorithms
• Nutrients = Data
• Gardener = You
• Plants = Programs
In Machine Learning, algorithms are like the seeds that the gardener plants. The algorithms are
trained on data and monitored to see how they perform. Just like a gardener adjusts the conditions
for each plant, the machine learning algorithm can be adjusted to improve its performance. This can
involve tweaking parameters, adding or removing features, or trying different algorithms altogether.
How does Machine Learning work
A Machine Learning system learns from historical data, builds the prediction models, and
whenever it receives new data, predicts the output for it.
The accuracy of predicted output depends upon the amount of data, as the huge amount of
data helps to build a better model which predicts the output more accurately.
Suppose we have a complex problem, where we need to perform some predictions, so instead
of writing a code for it, we just need to feed the data to generic algorithms, and with the help
of these algorithms, machine builds the logic as per the data and predict the output.
How does Machine Learning work…
The block diagram explains the working of Machine Learning algorithm:
Features of Machine Learning:
Adaptivity. One limiting feature of programmed tools is their rigidity - once the
program has been written down and installed, it stays unchanged.
However, many tasks change over time or from one user to another.
Machine learning tools - programs whose behavior adapts to their input data - offer
a solution to such issues; they are, by nature, adaptive to changes in the
environment they interact with.
Typical successful applications of machine learning to such problems include
programs that decode handwritten text, where a fixed program can adapt to
variations between the handwriting of different users; spam detection programs,
adapting automatically to changes in the nature of spam e-mails; and speech
recognition programs.
When Do We Need Machine Learning?...
Machine learning is particularly useful in the following scenarios:
1. Large Amounts of Data: When there is a vast amount of data to analyze, machine learning can process
and draw insights from it more efficiently than manual methods.
2. Complex Patterns and Relationships: When patterns or relationships in the data are too complex for
traditional statistical methods to uncover, machine learning algorithms can be employed to detect these
hidden patterns.
3. Automation of Repetitive Tasks: For tasks that are repetitive and rule-based, machine learning can
automate processes, reducing the need for human intervention and minimizing errors.
4. Dynamic Environments: In environments where conditions change rapidly and systems need to adapt
in real-time (e.g., financial markets, autonomous vehicles), machine learning models can adjust and
learn from new data continuously.
5. Personalization and Recommendations: When there's a need to provide personalized experiences or
recommendations based on user behavior and preferences, such as in e-commerce, streaming services,
or online advertising.
6. Predictive Analytics: For making predictions based on historical data, such as forecasting sales,
predicting equipment failures, or estimating customer churn.
When Do We Need Machine Learning?...
• Image and Speech Recognition: When dealing with tasks that require recognizing patterns in
images, audio, or video, such as facial recognition, speech-to-text conversion, and object detection.
• Natural Language Processing (NLP): For understanding and generating human language, machine
learning is used in applications like chatbots, sentiment analysis, language translation, and text
summarization.
• Anomaly Detection: When identifying unusual patterns that might indicate fraud, security breaches,
or other significant deviations from the norm, such as in cybersecurity, financial monitoring, and quality
control.
• Optimizing Operations: In logistics, supply chain management, and resource allocation, machine
learning can help optimize operations by predicting demand, identifying inefficiencies, and
recommending improvements.
• Enhanced Customer Support: For automating and improving customer service through chatbots
and virtual assistants that can handle a large volume of inquiries and provide quick, accurate
responses.
•Medical Diagnosis and Treatment: In healthcare, machine learning can assist in diagnosing
diseases, predicting patient outcomes, and personalizing treatment plans based on patient data.
Every machine learning algorithm has three components:
1. Representation: what the model looks like; how knowledge is represented.
2. Evaluation: how good models are differentiated; how programs are evaluated.
Evaluation is done using an evaluation function
3. Optimization: the process for finding good models; how programs are generated
Representation
• Decision trees
• Sets of rules / Logic programs
• Instances
• Graphical models (Bayes/Markov nets)
• Neural networks
• Support vector machines
• Model ensembles
Etc.
Evaluation
• Accuracy
• Precision and recall
• Squared error
• Likelihood
• Posterior probability
• Cost / Utility
• Margin
• Entropy
• K-L divergence
Etc.
Optimization
• Combinatorial optimization
E.g.: Greedy search
• Convex optimization
E.g.: Gradient descent
• Constrained optimization
E.g.: Linear programming
Types of Machine Learning
➢ Supervised learning
➢ Unsupervised learning
➢ Reinforcement learning
Supervised Learning
Supervised learning is a type of machine learning method in which we provide sample
labelled data to the machine learning system in order to train it, and on that basis, it
predicts the output.
The system creates a model using labelled data to understand the datasets and learn about
each data, once the training and processing are done then we test the model by providing
a sample data to check whether it is predicting the exact output or not.
The goal of supervised learning is to map input data with the output data. The supervised
learning is based on supervision, and it is the same as when a student learns things in the
supervision of the teacher.
The example of supervised learning is spam filtering.
Supervised learning can be grouped further in two categories of algorithms:
• Classification
• Regression
Unsupervised Learning
Unsupervised learning is a learning method in which a machine learns without any
supervision.
The training is provided to the machine with the set of data that has not been labelled,
classified, or categorized, and the algorithm needs to act on that data without any
supervision.
The goal of unsupervised learning is to restructure the input data into new features or a
group of objects with similar patterns.
In unsupervised learning, we don't have a predetermined result. The machine tries to find
useful insights from the huge amount of data.
It can be further classifieds into two categories of algorithms:
• Clustering
• Association
Reinforcement Learning
Reinforcement learning is a feedback-based learning method, in which a learning
agent gets a reward for each right action and gets a penalty for each wrong action.
The agent learns automatically with these feedbacks and improves its performance.
In reinforcement learning, the agent interacts with the environment and explores it.
The goal of an agent is to get the most reward points, and hence, it improves its
performance.
The robotic dog, which automatically learns the movement of his arms, is an example
of Reinforcement learning.
The three different types of machine learning
Supervised Learning
The main goal in supervised learning is to learn a model from labelled training data
that allows us to make predictions about unseen or future data.
Here, the term "supervised" refers to a set of training examples (data inputs) where
the desired output signals (labels) are already known.
The figure summarizes a typical supervised learning workflow, where the labelled
training data is passed to a machine learning algorithm for fitting a predictive model
that can make predictions on new, unlabelled data inputs:
In supervised learning, models are trained using labelled dataset, where the model
learns about each type of data. Once the training process is completed, the model is
tested on the basis of test data (a subset of the training set), and then it predicts the
output.
Suppose we have a dataset of different types of shapes
which includes square, rectangle, triangle, and Polygon.
Now the first step is that we need to train the model for
each shape.
• If the given shape has four sides, and all the sides are
equal, then it will be labelled as a Square.
• If the given shape has three sides, then it will be labelled
as a triangle.
• If the given shape has six equal sides then it will be
labelled as hexagon.
Now, after training, we test our model using the test set,
and the task of the model is to identify the shape. The
machine is already trained on all types of shapes, and when
it finds a new shape, it classifies the shape on the basis of
number of sides, and predicts the output.
1. Regression
2. Classification
Once it applies the suitable algorithm, the algorithm divides the data objects into groups according to the
similarities and difference between the objects.
Clustering: Clustering is a method of grouping the objects
into clusters such that objects with most similarities
remains into a group and has less or no similarities with
the objects of another group. Cluster analysis finds the
commonalities between the data objects and categorizes
them as per the presence and absence of those
commonalities.
Sometimes there can be many features (input values) with different weights:
y = b + w1x1 + w2x2 + w3x3 + w4x4
Notation and Conventions
The table depicts an excerpt of the
Iris dataset, which is a classic example
in the field of machine learning.
The Iris dataset contains the
measurements of 150 Iris flowers
from three different species—Setosa,
Versicolor, and Virginica.
Here, each flower example represents
one row in our dataset, and the
flower measurements in centimeters
are stored as columns, which we also
call the features of the dataset:
To keep the notation and implementation simple yet efficient, we will make use of some
of the basics of linear algebra.
We will follow the common convention to represent each example as a separate row in
a feature matrix, X, where each feature is stored as a separate column.
The Iris dataset, consisting of 150 examples and four features, can then be written as
a 150 × 4 matrix, 𝑿 ∈ ℝ150x4 :
superscript i refer to the ith training example, and the subscript j refer to the jth
dimension of the training dataset.
We will use lowercase, bold-face letters to refer to vectors (𝒙 ∈ ℝ𝒏×𝟏) and
uppercase, bold-face letters to refer to matrices (𝑿 ∈ ℝ𝒏×𝒎) .
To refer to single elements in a vector or matrix, we will write the letters in italics (𝑥(𝑛) or
𝑥𝑚 (𝑛) , respectively).
For example, 𝑥1(150) refers to the first dimension of flower example 150, the sepal
length.
Thus, each row in this feature matrix represents one flower instance and can be
written as a four-dimensional row vector, 𝒙(𝑖) ∈ ℝ𝟏×𝟒 :
And each feature dimension is a 150-dimensional column vector, 𝒙(𝑖) ∈ ℝ𝟏50×𝟏 . For
example:
Similarly, we will store the target variables (here, class labels) as a 150-dimensional
column vector:
Machine Learning Terminology
Machine learning is a vast field and also very interdisciplinary as it brings together
many scientists from other areas of research. As it happens, many terms and
concepts have been rediscovered or redefined and may already be familiar to you
but appear under different names.
Training example: A row in a table representing the dataset and synonymous with
an observation, record, instance, or sample (in most contexts, sample refers to a
collection of training examples).
Training: Model fitting, for parametric models similar to parameter estimation.
Feature, abbrev. x: A column in a data table or data (design) matrix.
Synonymous with predictor, variable, input, attribute, or covariate.
Target, abbrev. y: Synonymous with outcome, output, response variable, dependent
variable, (class) label, and ground truth.
Loss function: Often used synonymously with a cost function. Sometimes the loss
function is also called an error function.
In some literature, the term "loss" refers to the loss measured for a single data point,
and the cost is a measurement that computes the loss (average or summed) over the
entire dataset.
A roadmap for building machine learning systems
The diagram shows a typical workflow for using machine learning in predictive modeling:
Preprocessing
Raw data rarely comes in the form and shape that is necessary for the optimal
performance of a learning algorithm. Thus, the preprocessing of the data is one of the
most crucial steps in any machine learning application.
If we take the Iris flower dataset as an example, we can think of the raw data as a
series of flower images from which we want to extract meaningful features.
Useful features could be the color, hue, and intensity of the flowers, or the height,
length, and width of the flowers.
Many machine learning algorithms also require that the selected features are on the
same scale for optimal performance, which is often achieved by transforming the
features in the range [0, 1] or a standard normal distribution with zero mean and unit
variance.
Preprocessing…
Some of the selected features may be highly correlated and therefore redundant to a
certain degree.
In those cases, dimensionality reduction techniques are useful for compressing the
features onto a lower dimensional subspace.
Reducing the dimensionality of our feature space has the advantage that less storage
space is required, and the learning algorithm can run much faster.
In certain cases, dimensionality reduction can also improve the predictive
performance of a model if the dataset contains a large number of irrelevant features
(or noise); that is, if the dataset has a low signal-to-noise ratio.
To determine whether our machine learning algorithm not only performs well on
the training dataset but also generalizes well to new data, we also want to
randomly divide the dataset into a separate training and test dataset.
We use the training dataset to train and optimize our machine learning model,
while we keep the test dataset until the very end to evaluate the final model.
Training and selecting a predictive model
Each classification algorithm has its inherent biases, and no single classification model
enjoys superiority if we don't make any assumptions about the task.
It is essential to compare at least a handful of different algorithms in order to train
and select the best performing model.
But before we can compare different models, we first have to decide upon a metric to
measure performance.
One commonly used metric is classification accuracy, which is defined as the
proportion of correctly classified instances.
How do we know which model performs well on the final test dataset and real-world
data if we don't use this test dataset for the model selection, but keep it for the final
model evaluation?
In order to address the issue embedded in this question, different techniques
summarized as "cross-validation" can be used.
In cross-validation, we further divide a dataset into training and validation subsets in
order to estimate the generalization performance of the model.
Finally, we also cannot expect that the default parameters of the different learning
algorithms provided by software libraries are optimal for our specific problem task.
Therefore, we will make frequent use of hyperparameter optimization techniques that
help us to fine-tune the performance of our model.
We can think of those hyperparameters as parameters that are not learned from the
data but represent the knobs of a model that we can turn to improve its performance.
Evaluating models and predicting unseen data instances
After we have selected a model that has been fitted on the training dataset, we can
use the test dataset to estimate how well it performs on this unseen data to estimate
the so-called generalization error.
If we are satisfied with its performance, we can now use this model to predict new,
future data.
It is important to note that the parameters for feature scaling, dimensionality
reduction etc, are solely obtained from the training dataset, and the same parameters
are later reapplied to transform the test dataset, as well as any new data instances—
the performance measured on the test data may be overly optimistic otherwise.
Using Python for Machine Learning
Python is one of the most popular programming languages for data science and a
large number of useful libraries for scientific computing and machine learning have
been developed.
Although the performance of interpreted languages, such as Python, for
computation-intensive tasks is inferior to lower-level programming languages,
extension libraries such as NumPy and SciPy have been developed that build upon
lower-layer Fortran and C implementations for fast vectorized operations on
multidimensional arrays.
For machine learning programming tasks, we will mostly refer to the scikit-learn
library, which is currently one of the most popular and accessible open source
machine learning libraries
For subfield of machine learning called deep learning, we will use the latest version of
the TensorFlow library, which specializes in training so-called deep neural network
models very efficiently by utilizing graphics cards.
Installing Python and packages from the Python
Package Index
Python is available for all three major operating systems—Microsoft Windows, macOS, and
Linux—and the installer, as well as the documentation, can be downloaded from the official
Python website: https://www.python.org.
Strongly advise that you use Python 3.7 or newer.
The additional packages can be installed via the pip installer program, which has been part of
the Python Standard Library since Python 3.3.
More information about pip can be found at https://docs.python.org/3/installing/index.html.
After we have successfully installed Python, we can execute pip from the terminal to install
additional Python packages:
pip install SomePackage
Already installed packages can be updated via the --upgrade flag:
pip install SomePackage --upgrade
Using the Anaconda Python distribution and
package manager
A highly recommended alternative Python distribution for scientific computing is Anaconda
by Continuum Analytics.
Anaconda is a free—including commercial use—enterprise-ready Python distribution that
bundles all the essential Python packages for data science, math, and engineering into one
user-friendly, cross-platform distribution.
The Anaconda installer can be downloaded at https://docs.anaconda.com/anaconda/install/,
and an Anaconda quick start guide is available at https://docs.anaconda.com/anaconda/user-
guide/getting-started/.
After successfully installing Anaconda, we can install new Python packages using
the following command:
conda install SomePackage
Existing packages can be updated using the following command:
conda update SomePackage
Packages for scientific computing, data science, and
machine learning
We will mainly use NumPy's multidimensional arrays to store and manipulate data.
Occasionally, we will make use of pandas, which is a library built on top of NumPy
that provides additional higher-level data manipulation tools that make working
with tabular data even more convenient. To augment learning experience and
visualize quantitative data, we will use the very customizable Matplotlib library.
Please make sure that the version numbers of your installed packages are equal to,
or greater than, the version numbers given below to ensure that the code examples
run correctly:
NumPy 1.17.4
SciPy 1.3.1
scikit-learn 0.22.0
Matplotlib 3.1.0
pandas 0.25.3
Data Preprocessing
Data preprocessing is a process of preparing the raw data and making it suitable for a
machine learning model. It is the first and crucial step while creating a machine
learning model.
The quality of the data and the amount of useful information that it contains are key
factors that determine how well a machine learning algorithm can learn.
A real-world data generally contains noises, missing values, and maybe in an unusable
format which cannot be directly used for machine learning models.
Data preprocessing is required tasks for cleaning the data and making it suitable for a
machine learning model which also increases the accuracy and efficiency of a
machine learning model.
It is absolutely critical to ensure that we examine and preprocess a dataset before we
feed it to a learning algorithm.
Data Preprocessing…
It involves below steps:
➢ Getting the dataset
➢ Importing libraries
➢ Importing datasets
➢ Finding Missing Data
➢ Encoding Categorical Data
➢ Splitting dataset into training and test set
➢ Feature scaling
1) Get the Dataset
To create a machine learning model, the first thing we required is a dataset as a
machine learning model completely works on data.
The collected data for a particular problem in a proper format is known as the
dataset.
Dataset may be of different formats for different purposes, such as, if we want to
create a machine learning model for business purpose, then dataset will be different
with the dataset required for a liver patient. So each dataset is different from another
dataset.
To use the dataset in our code, we usually put it into a CSV file. However, sometimes,
we may also need to use an HTML or xlsx file.
1) Get the Dataset…
What is a CSV File?
CSV stands for "Comma-Separated Values" files; it is a file format which allows us
to save the tabular data, such as spreadsheets. It is useful for huge datasets and
can use these datasets in programs.
Here we will use a demo dataset for data preprocessing, and for practice, it can be
downloaded from here, "https://www.superdatascience.com/pages/machine-
learning.
For real-world problems, we can download datasets online from various sources
such as https://www.kaggle.com/uciml/datasets,
https://archive.ics.uci.edu/ml/index.php etc.
We can also create our dataset by gathering data using various API with Python
and put that data into a .csv file.
2) Importing Libraries
In order to perform data preprocessing using Python, we need to import some
predefined Python libraries. These libraries are used to perform some specific jobs.
There are three specific libraries that we will use for data preprocessing, which are:
Numpy: Numpy Python library is used for including any type of mathematical
operation in the code.
It is the fundamental package for scientific calculation in Python. It also supports to
add large, multidimensional arrays and matrices.
So, in Python, we can import it as:
import numpy as nm
Here we have used nm, which is a short name for Numpy, and it will be used in the
whole program.
2) Importing Libraries…
Matplotlib: The second library is matplotlib, which is a Python 2D plotting library, and
with this library, we need to import a sub-library pyplot.
This library is used to plot any type of charts in Python for the code.
It will be imported as below:
import matplotlib.pyplot as mpt
Here we have used mpt as a short name for this library.
2) Importing Libraries…
Pandas: The last library is the Pandas library, which is one of the most famous Python
libraries and used for importing and managing the datasets. It is an open-source data
manipulation and analysis library. It will be imported as below:
Example:
import pandas as pd
mydataset = {
'cars': ["BMW", "Volvo", "Ford"],
'passings': [3, 7, 2]
}
Here, we have used pd as a short name for this library.
myvar = pd.DataFrame(mydataset)
print(myvar)
3) Importing the Datasets
Now we need to import the datasets which we have collected for our machine
learning project.
But before importing a dataset, we need to set the current directory as a working
directory.
To set a working directory in Spyder IDE, we need to follow the below steps
1. Save your Python file in the directory which contains dataset.
2. Go to File explorer option in Spyder IDE, and select the required directory.
3. Click on F5 button or run option to execute the file.
Here, in the below image, we can see the Python file along with required dataset.
Now, the current folder is set as a working directory.
3) Importing the Datasets…
Pandas is a Python library used for working with data sets.
It has functions for analyzing, cleaning, exploring, and manipulating data.
Pandas can clean messy data sets, and make them readable and relevant.
read_csv() function:
To import the dataset, we will use read_csv() function of pandas library, which is used to read a
csv file and performs various operations on it.
Using this function, we can read a csv file locally as well as through an URL.
We can use read_csv function as below:
data_set= pd.read_csv('Dataset.csv’)
Here, data_set is a name of the variable to store our dataset, and inside the function, we have
passed the name of our dataset.
Once we execute the above line of code, it will successfully import the dataset in our code.
We can also check the imported dataset by clicking on the section variable explorer,
and then double click on data_set. Consider the below image:
As in the above image, indexing is started from 0, which is the default indexing in Python. We can also
change the format of our dataset by clicking on the format option.
Extracting dependent and independent variables:
In machine learning, it is important to distinguish the matrix of features
(independent variables) and dependent variables from dataset. In our dataset, there
are three independent variables that are Country, Age, and Salary, and one is a
dependent variable which is Purchased.
Extracting independent variable:
To extract an independent variable, we will use iloc[ ] method of Pandas library. It is
used to extract the required rows and columns from the dataset.
x= data_set.iloc[:,:-1].values
In the above code, the first colon(:) is used to take all the rows, and the second
colon(:) is for all the columns. Here we have used :-1, because we don't want to take
the last column as it contains the dependent variable. So by doing this, we will get
the matrix of features.
By executing the above code, we will get output as
[['India' 38.0 68000.0]
['France' 43.0 45000.0]
['Germany' 30.0 54000.0]
['France' 48.0 65000.0]
['Germany' 40.0 nan]
['India' 35.0 58000.0]
['Germany' nan 53000.0]
['France' 49.0 79000.0]
['India' 50.0 88000.0]
['France' 37.0 77000.0]] :
As we can see in the above output, there are only three variables.
Extracting dependent variable:
To extract dependent variables, again, we will use Pandas .iloc[] method.
y= data_set.iloc[:,3].values
Here we have taken all the rows with the last column only. It will give the array of
dependent variables.
By executing the above code, we will get output as:
array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
dtype=object)
Note: If you are using Python language for machine learning, then extraction is mandatory,
but for R language it is not required
Dealing with missing data
It is not uncommon in real-world applications for our training examples to be
missing one or more values for various reasons.
There could have been an error in the data collection process, certain
measurements may not be applicable, or particular fields could have been simply
left blank in a survey, for example.
We typically see missing values as blank spaces in our data table or as placeholder
strings such as NaN, which stands for "not a number," or NULL (a commonly used
indicator of unknown values in relational databases)
Most computational tools are unable to handle such missing values or will produce
unpredictable results if we simply ignore them.
Therefore, it is crucial that we take care of those missing values before we proceed
with further analyses.
4) Handling Missing data:
The next step of data preprocessing is to handle missing data in the datasets.
We typically see missing values as blank spaces in our data table or as placeholder strings
such as NaN, which stands for "not a number," or NULL.
If our dataset contains some missing data, then it may create a huge problem for our
machine learning model.
Ways to handle missing data:
There are mainly two ways to handle missing data, which are:
1. By deleting the particular row:
The first way is used to commonly deal with null values.
In this way, we just delete the specific row or column which consists of null values.
But this way is not so efficient and removing data may lead to loss of information which will
not give the accurate output.
2. By calculating the mean:
In this way, we will calculate the mean of that column or row which contains any missing
value and will put it on the place of missing value.
This strategy is useful for the features which have numeric data such as age, salary, year, etc.
Eliminating training examples or features with missing values
One of the easiest ways to deal with missing data is
simply to remove the corresponding features Pandas DataFrame.dropna() Syntax
(columns) or training examples (rows) from the Syntax: DataFrameName.dropna(axis=0,
dataset entirely; how=’any’, thresh=None, subset=None,
Rows with missing values can easily be dropped via inplace=False)
the dropna method:
Parameter Value Description
>>> df.dropna(axis=0)
axis 0 Optional, default 0.
A B C D 1 0 and 'index'removes ROWS that contains NULL
0 1.0 2.0 3.0 4.0 'index' values
'columns' 1 and 'columns' removes COLUMNS that
Similarly, we can drop columns that have at least contains NULL values
one NaN in any row by setting the axis argument to
1: how 'all' Optional, default 'any'. Specifies whether to
'any' remove the row or column when ALL values are
>>> df.dropna(axis=1) NULL, or if ANY value is NULL.
A B thresh Number Optional, Specifies the number of NOT NULL
0 1.0 2.0 import pandas as pd values required to keep the row.
subset List Optional, specifies where to look for NULL values
1 5.0 6.0
df = pd.read_csv('data.csv') inplace True Optional, default False. If True: the removing is
2 10.0 11.0 False done on the current DataFrame. If False: returns
newdf = df.dropna() a copy where the removing is done.
The dropna method supports several additional parameters that can come in handy:
# only drop rows where all columns are NaN
# (returns the whole array here since we don’t have a row with all values NaN)
>>> df.dropna(how='all’)
A B C D Although the removal of missing data
0 1.0 2.0 3.0 4.0 seems to be a convenient approach,
it also comes with certain
1 5.0 6.0 NaN 8.0 disadvantages; for example, we may
2 10.0 11.0 12.0 NaN end up removing too many samples,
# drop rows that have fewer than 4 real values which will make a reliable analysis
>>> df.dropna(thresh=4) impossible.
A B C D
Or, if we remove too many feature
0 1.0 2.0 3.0 4.0
columns, we will run the risk of losing
# only drop rows where NaN appear in specific columns (here: 'C') valuable information that our
>>> df.dropna(subset=['C’]) classifier needs to discriminate
A B C D between classes
0 1.0 2.0 3.0 4.0
2 10.0 11.0 12.0 NaN
Imputing missing values
Often, the removal of training examples or dropping of entire feature columns is
simply not feasible, because we might lose too much valuable data.
In this case, we can use different interpolation techniques to estimate the missing
values from the other training examples in our dataset.
One of the most common interpolation techniques is mean imputation, where we
simply replace the missing value with the mean value of the entire feature column.
A convenient way to achieve this is by using the SimpleImputer class from scikit-learn
>>> from sklearn.impute import SimpleImputer
>>> import numpy as np
>>> imr = SimpleImputer(missing_values=np.nan, strategy='mean')
>>> imr = imr.fit(df.values)
>>> imputed_data = imr.transform(df.values)
>>> imputed_data
array([[ 1., 2., 3., 4.],
[[ 5., 6., 7.5, 8.],
10., 11., 12., 6.]])
Here, we replaced each NaN value with the corresponding mean, which is separately calculated for
each feature column.
Other options for the strategy parameter are median or most_frequent, where the latter
replaces the missing values with the most frequent values.
This is useful for imputing categorical feature values, for example, a feature column that stores an
encoding of color names, such as red, green, and blue.
Alternatively, an even more convenient way to impute missing values is by using pandas' fillna
method and providing an imputation method as an argument.
For example, using pandas, we could achieve the same mean imputation directly in the DataFrame
object via the following command: Parameter Value Description
value Number Required, Specifies the value to replace the NULL values
>>> df.fillna(df.mean()) String with.
Dictionary This can also be values for the entire row or column.
Series
DataFrame
method 'backfill' Optional, default None'. Specifies the method to use when
'bfill' replacing
'pad'
'ffill'
None
The fillna() method replaces the NULL values with a
axis 0 Optional, default 0. The axis to fill the NULL values along
specified value.
1
Syntax 'index'
dataframe.fillna(value, method, axis, inplace, limit, 'columns'
downcast) inplace True Optional, default False. If True: the replacing is done on the
Example False current DataFrame. If False: returns a copy where the
replacing is done.
import pandas as pd
limit Number Optional, default None. Specifies the maximum number of
df = pd.read_csv('data.csv') None NULL values to fill (if method is specified)
newdf = df.fillna(222222) downcast Dictionary Optional, a dictionary of values to fill for specific data types
Replace NULL values with the number 222222 None
The SimpleImputer class belongs to the so-called transformer classes in scikit-learn, which are used for
data transformation.
The two essential methods of those estimators are fit and transform.
The fit method is used to learn the parameters from the training data, and the transform method uses
those parameters to transform the data. Any data array that is to be transformed needs to have the
same number of features as the data array that was used to fit the model.
Figure illustrates how a transformer, fitted on the training data, is used to transform a training dataset as well as a new test
dataset:
To handle missing values, we will use Scikit-learn library in our code, which contains
various libraries for building machine learning models. Here we will
use Imputer class of sklearn.preprocessing library.
#handling missing data (Replacing missing data with the mean value)
from sklearn.preprocessing import Imputer
imputer= Imputer(missing_values ='NaN', strategy='mean', axis = 0)
#Fitting imputer object to the independent variables x.
imputerimputer= imputer.fit(x[:, 1:3])
#Replacing missing data with the calculated mean value
x[:, 1:3]= imputer.transform(x[:, 1:3])
Output:
The newly created DataFrame contains a nominal feature (color), an ordinal feature (size), and a numerical
feature (price) column. The class labels (assuming that we created a dataset for a supervised learning task)
are stored in the last column.
5) Encoding Categorical data:
Categorical data is data which has some categories such as, in our dataset; there are
two categorical variable, Country, and Purchased.
Since machine learning model completely works on mathematics and numbers, but
if our dataset would have a categorical variable, then it may create trouble while
building the model.
So it is necessary to encode these categorical variables into numbers.
For Country variable:
Firstly, we will convert the country variables into categorical data.
To do this, we will use LabelEncoder() class from preprocessing library.
#Categorical data
#for Country Variable
Output:
from sklearn.preprocessing import LabelEncoder
label_encoder_x= LabelEncoder() Out[15]:
array([[2, 38.0, 68000.0],
x[:, 0]= label_encoder_x.fit_transform(x[:, 0]) [0, 43.0, 45000.0],
[1, 30.0, 54000.0],
Explanation: [0, 48.0, 65000.0],
In above code, we have imported LabelEncoder class of sklearn [1, 40.0, 65222.22222222222],
library. This class has successfully encoded the variables into [2, 35.0, 58000.0],
digits. [1, 41.111111111111114,
But in our case, there are three country variables, and as we can 53000.0],
see in the above output, these variables are encoded into 0, 1, [0, 49.0, 79000.0],
and 2. By these values, the machine learning model may assume [2, 50.0, 88000.0],
that there is some correlation between these variables which will [0, 37.0, 77000.0]],
produce the wrong output. So to remove this issue, we will dtype=object)
use dummy encoding.
Dummy Variables:
Dummy variables are those variables which have values 0 or 1. The 1 value gives the
presence of that variable in a particular column, and rest variables become 0.
With dummy encoding, we will have a number of columns equal to the number of
categories.
In our dataset, we have 3 categories so it will produce three columns having 0 and 1
values. For Dummy Encoding, we will use OneHotEncoder class
of preprocessing library.
Output:
Out[17]: array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])
It can also be seen as:
6) Splitting the Dataset into the Training set and Test set
In machine learning data preprocessing, we divide our dataset into a training set and
test set. This is one of the crucial steps of data preprocessing as by doing this, we can
enhance the performance of our machine learning model.
Suppose, if we have given training to our machine learning model by a dataset and
we test it by a completely different dataset. Then, it will create difficulties for our
model to understand the correlations between the models.
If we train our model very well and its training accuracy is also very high, but we
provide a new dataset to it, then it will decrease the performance.
So we always try to make a machine learning model which performs well with the
training set and also with the test dataset.
We can define these datasets as:
Training Set: A subset of dataset to train the machine learning model, and we already know the output.
Test set: A subset of dataset to test the machine learning model, and by using the test set, model
predicts the output.
For splitting the dataset, use the code:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)
Explanation:
• The first line is used for splitting arrays of the dataset into random train and test subsets.
• In the second line, we have used four variables for our output that are
• x_train: features for the training data
• x_test: features for testing data
• y_train: Dependent variables for training data
• y_test: Independent variable for testing data
• In train_test_split() function, we have passed four parameters in which first two are for
arrays of data, and test_size is for specifying the size of the test set.
• The test_size maybe .5, .3, or .2, which tells the dividing ratio of training and testing sets.
• The last parameter random_state is used to set a seed for a random generator so that you
always get the same result, and the most used value for this is 42.
Output:
By executing the above code, we will get 4 different variables, which can be seen
under the variable explorer section.
As we can see in the above image, the x and y variables are divided into 4 different
variables with corresponding values
7) Feature Scaling
Feature scaling is the final step of data
preprocessing in machine learning.
Decision trees and random forests are two of
the very few machine learning algorithms
where we don't need to worry about feature
scaling. Those algorithms are scale invariant.
Majority of machine learning and optimization
algorithms behave much better if features are
on the same scale.
Assume that we have two features where one
feature is measured on a scale from 1 to 10 and
the second feature is measured on a scale from
1 to 100,000, respectively.
Feature scaling is a technique to standardize
the independent variables of the dataset in a
specific range.
In feature scaling, we put our variables in the
same range and in the same scale so that no
any variable dominate the other variable.
Consider the given dataset:
As we can see, the age and salary column values are not on the same scale. A
machine learning model is based on Euclidean distance, and if we do not scale the
variable, then it will cause some issue in our machine learning model.
Euclidean distance is given as:
Normalization
Normalization refers to the rescaling of the features to a range of [0, 1], which is a special
case of min-max scaling.
The min-max scaling procedure is implemented in scikit-learn and can be used as follows:
>>> from sklearn.preprocessing import MinMaxScaler
>>> mms = MinMaxScaler()
>>> X_train_norm = mms.fit_transform(X_train)
>>> X_test_norm = mms.transform(X_test)
As we can see in the above output, all the variables are scaled between values -1 to 1
Note: Here, we have not scaled the dependent variable because there are only two values 0
and 1. But if these variables will have more range of values, then we will also need to scale
those variables.
Other, more advanced methods for feature scaling are available from scikitlearn, such
as the RobustScaler. The RobustScaler is especially helpful and recommended if
we are working with small datasets that contain many outliers.
Lasso Regression (Least Absolute Shrinkage and Selection Operator) adds “Absolute value of magnitude” of
coefficient, as penalty term to the loss function.
Sparse Solutions: L1 regularization tends to produce sparse solutions, meaning many of the coefficients are
driven to zero. This is useful for feature selection, as it effectively selects a subset of the most important
features.
Feature Selection: Because L1 regularization can zero out coefficients, it can be used to automatically select
features that are most relevant to the predictive model.
▪ L1 regularization is that it is easy to implement and can be trained as a one-shot thing, meaning that once
it is trained you are done with it and can just use the parameter vector and weights.
▪ L1 regularization is robust in dealing with outliers. It creates sparsity in the solution (most of the
coefficients of the solution are zero), which means the less important features or noise terms will be zero. It
makes L1 regularization robust to outliers.
Sparse solutions with L1 regularization
Since the L1 penalty is the sum of the absolute weight coefficients (remember that the L2 term is
quadratic), we can represent it as a diamond-shape budget
Ridge regression adds “squared magnitude of the coefficient” as penalty term to the loss function. Here
the box part in the above image represents the L2 regularization element/term.
Small Coefficients: L2 regularization tends to shrink the coefficients, but not necessarily to zero. This means
that while it reduces the impact of less important features, it usually retains all features in the model.
Smooth Solutions: L2 regularization tends to produce smoother models, which can be advantageous in
preventing overfitting when there are many correlated features.
Ridge regression performs better when all the input features influence the output, and all with weights are
of roughly equal size.
L2 regularization can learn complex data patterns
A geometric interpretation of L2 regularization
Our goal is to minimize the sum of the unpenalized cost plus the penalty term, which can be understood as
adding bias and preferring a simpler model to reduce the variance in the absence of sufficient training data to
fit the model.
Sequential feature selection algorithms
An alternative way to reduce the complexity of the model and avoid overfitting is
dimensionality reduction via feature selection, which is especially useful for
unregularized models.
There are two main categories of dimensionality reduction techniques: feature
selection and feature extraction.
Via feature selection, we select a subset of the original features, whereas in feature
extraction, we derive information from the feature set to construct a new feature
subspace.
Sequential feature selection algorithms are a family of greedy search algorithms that
are used to reduce an initial d-dimensional feature space to a k-dimensional feature
subspace where k<d.
The motivation behind feature selection algorithms is to automatically select a subset
of features that are most relevant to the problem, to improve computational
efficiency, or to reduce the generalization error of the model by removing irrelevant
features or noise, which can be useful for algorithms that don't support
regularization.
A classic sequential feature selection algorithm is sequential backward selection
(SBS), which aims to reduce the dimensionality of the initial feature subspace with a
minimum decay in the performance of the classifier to improve upon computational
efficiency.
In certain cases, SBS can even improve the predictive power of the model if a model
suffers from overfitting.
SBS sequentially removes features from the full feature subset until the new feature
subspace contains the desired number of features.
In order to determine which feature is to be removed at each stage, we need to
define the criterion function, J, that we want to minimize.
The criterion calculated by the criterion function can simply be the difference in
performance of the classifier before and after the removal of a particular feature.
Then, the feature to be removed at each stage can simply be defined as the feature
that maximizes this criterion; or in more simple terms, at each stage we eliminate
the feature that causes the least performance loss after removal
Sequential Backward Selection (SBS) Algorithm