Machine Learning Unit 1
Machine Learning Unit 1
Unit :- 1
Topics
1. Intro to ML
2. ML Life Cycle
3. Types of ML
3.1 supervised and unsupervised
3.2 batch and online
3.3 instance based and model based
4. Scope and Limitations
5. Challenges Of ML
6. Data Visualization
7. Hypothesis Fn. And testing
8. Data Pre-processing
9. Data Augmentation
10. Normalizing Data Set
11. Bias-Variance Tradeoff
12. Relation Between AI,ML,DP and DS
ML Life Cycle
Machine learning has given the computer
systems the abilities to automatically learn
without being explicitly programmed. But
how does a machine learning system work?
So, it can be described using the life cycle of
machine learning. Machine learning life cycle
is a cyclic process to build an efficient
machine learning project. The main purpose of
the life cycle is to find a solution to the
problem or project.
Machine learning life cycle involves seven
major steps, which are given below:
o Gathering Data
o Data preparation
o Data Wrangling
o Analyse Data
o Deployment
The most important thing in the complete
process is to understand the problem and to
know the purpose of the problem. Therefore,
before starting the life cycle, we need to
understand the problem because the good
result depends on the better understanding of
the problem.
In the complete life cycle process, to solve a
problem, we create a machine learning system
called "model", and this model is created by
providing "training". But to train a model, we
need data, hence, life cycle starts by collecting
data.
1. Gathering Data:
Data Gathering is the first step of the machine
learning life cycle. The goal of this step is to
identify and obtain all data-related problems.
In this step, we need to identify the different
data sources, as data can be collected from
various sources such
as files, database, internet, or mobile
devices. It is one of the most important steps
of the life cycle. The quantity and quality of
the collected data will determine the efficiency
of the output. The more will be the data, the
more accurate will be the prediction.
This step includes the below tasks:
o Identify various data sources
o Collect data
3. Data Wrangling
Data wrangling is the process of cleaning and
converting raw data into a useable format. It is
the process of cleaning the data, selecting the
variable to use, and transforming the data in a
proper format to make it more suitable for
analysis in the next step. It is one of the most
important steps of the complete process.
Cleaning of data is required to address the
quality issues.
It is not necessary that data we have collected
is always of our use as some of the data may
not be useful. In real-world applications,
collected data may have various issues,
including:
o Missing Values
o Duplicate data
o Invalid data
o Noise
o Building models
5. Train Model
Now the next step is to train the model, in this
step we train our model to improve its
performance for better outcome of the
problem.
We use datasets to train the model using
various machine learning algorithms. Training
a model is required so that it can understand
the various patterns, rules, and, features.
6. Test Model
Once our machine learning model has been
trained on a given dataset, then we test the
model. In this step, we check for the accuracy
of our model by providing a test dataset to it.
Testing the model determines the percentage
accuracy of the model as per the requirement
of project or problem.
7. Deployment
The last step of machine learning life cycle is
deployment, where we deploy the model in the
real-world system.
If the above-prepared model is producing an
accurate result as per our requirement with
acceptable speed, then we deploy the model in
the real system. But before deploying the
project, we will check whether it is improving
its performance using available data or not.
The deployment phase is similar to making the
final report for a project.
Types of ML
Machine learning is a subset of AI, which
enables the machine to automatically learn
from data, improve performance from past
experiences, and make predictions. Machine
learning contains a set of algorithms that work on
a huge amount of data. Data is fed to these
algorithms to train them, and on the basis of
training, they build the model & perform a specific
task.
Supervised
Machine
Learning
Supervised learning is the types of machine
learning in which machines are trained using
well "labelled" training data, and on basis of
that data, machines predict the output. The
labelled data means some input data is already
tagged with the correct output.
In supervised learning, the training data
provided to the machines work as the
supervisor that teaches the machines to predict
the output correctly. It applies the same
concept as a student learns in the supervision
of the teacher.
Supervised learning is a process of providing
input data as well as correct output data to the
machine learning model. The aim of a
supervised learning algorithm is to find a
mapping function to map the input
variable(x) with the output variable(y).
Supervised learning: Supervised
learning is the learning of the model
where with input variable ( say, x) and
an output variable (say, Y) and an
algorithm to map the input to the
output. That is, Y = f(X)
1. Regression
Regression algorithms are used if there is a
relationship between the input variable and the
output variable. It is used for the prediction of
continuous variables, such as Weather
forecasting, Market Trends, etc. Below are
some popular Regression algorithms which
come under supervised learning:
o Linear Regression
o Regression Trees
o Non-Linear Regression
o Polynomial Regression
2. Classification
Classification algorithms are used when the
output variable is categorical, which means
there are two classes such as Yes-No, Male-
Female, True-false, etc.
Spam Filtering,
o Random Forest
o Decision Trees
o Logistic Regression
Advantages of
Supervised learning:
o With the help of supervised learning, the
model can predict the output on the basis
of prior experiences.
o In supervised learning, we can have an
exact idea about the classes of objects.
o Supervised learning model helps us to
solve various real-world problems such
as fraud detection, spam filtering, etc.
Disadvantages of
supervised learning:
o Supervised learning models are not
suitable for handling the complex tasks.
o Supervised learning cannot predict the
correct output if the test data is different
from the training dataset.
o Training required lots of computation
times.
o In supervised learning, we need enough
knowledge about the classes of object
Unsupervis
ed
Machine
Learning
In the previous topic, we learned supervised
machine learning in which models are trained
using labeled data under the supervision of
training data. But there may be many cases in
which we do not have labeled data and need to
find the hidden patterns from the given
dataset. So, to solve such types of cases in
machine learning, we need unsupervised
learning techniques.
Unsupervised
Learning: Unsupervised learning is
where only the input data (say, X) is
present and no corresponding output
variable is there.
What is Unsupervised
Learning?
As the name suggests, unsupervised learning
is a machine learning technique in which
models are not supervised using training
dataset. Instead, models itself find the hidden
patterns and insights from the given data. It
can be compared to learning which takes place
in the human brain while learning new things.
It can be defined as:
Unsupervised learning is a type of machine
learning in which models are trained using
unlabeled dataset and are allowed to act on
that data without any supervision.
Unsupervised learning cannot be directly
applied to a regression or classification
problem because unlike supervised learning,
we have the input data but no corresponding
output data. The goal of unsupervised learning
is to find the underlying structure of
dataset, group that data according to
similarities, and represent that dataset in a
compressed format.
Example: Suppose the unsupervised learning
algorithm is given an input dataset containing
images of different types of cats and dogs. The
algorithm is never trained upon the given
dataset, which means it does not have any idea
about the features of the dataset. The task of
the unsupervised learning algorithm is to
identify the image features on their own.
Unsupervised learning algorithm will perform
this task by clustering the image dataset into
the groups according to similarities between
images.
Why use Unsupervised
Learning?
Below are some main reasons which describe
the importance of Unsupervised Learning:
o Unsupervised learning is helpful for
finding useful insights from the data.
o Unsupervised learning is much similar as a
Unsupervised
Learning algorithms:
Below is the list of some popular unsupervised
learning algorithms:
o K-means clustering
o Hierarchal clustering
o Anomaly detection
o Neural Networks
o Apriori algorithm
Supervised Learning
Supervised learning algorithms are trained using label
Supervised Learning
What is instance-based
learning & how does it work?
Instance-based learning(also known
as memory-based learning or lazy learning)
involves memorizing training data in order to
make predictions about future data points.
This approach doesn’t require any prior
knowledge or assumptions about the data,
which makes it easy to implement and
understand. However, it can be
computationally expensive since all of the
training data needs to be stored in memory
before making a prediction. Additionally, this
approach doesn’t generalize well to unseen
data sets because its predictions are based on
memorized examples rather than learned
models.
In instance-based learning, the system learns
the training data by heart. At the time of
making prediction, the system uses similarity
measure and compare the new cases with the
learned data. K-nearest neighbors (KNN) is an
algorithm that belongs to the instance-based
learning class of algorithms. KNN is a non-
parametric algorithm because it does not
assume any specific form or underlying
structure in the data. Instead, it relies on a
measure of similarity between each pair of
data points. Generally speaking, this measure
is based on either Euclidean distance or cosine
similarity; however, other forms of metric can
be used depending on the type of data being
analyzed. Once the similarity between two
points is calculated, KNN looks at how many
neighbors are within a certain radius around
that point and uses these neighbors as
examples to make its prediction. This means
that instead of creating a generalizable model
from all of the data, KNN looks for
similarities among individual data
points and makes predictions
accordingly. The picture below demonstrates
how the new instance will be predicted as
triangle based on greater number of triangles
in its proximity.
In addition to providing accurate predictions,
one major advantage of using KNN over other
forms of supervised learning algorithms is its
versatility; KNN can be used with both
numeric datasets – such as when predicting
house prices – and categorical datasets – such
as when predicting whether a website visitor
will purchase a product or not. Furthermore,
there are no parameters involved in tuning
KNN since it does not assume any underlying
structure in the data that needs to be fitted
into; instead, all parameters involved are
dependent on how close two points are
considered to be similar.
Because KNN is an instance-based learning
algorithm, it is not suitable for very large
datasets. This is because the model has to store
all of the training examples in memory, and
making predictions on new data points
involves comparing the new point to all of the
stored training examples. However, for small
or medium-sized datasets, KNN can be a very
effective learning algorithm.
Other instance-based learning algorithms
include learning vector quantization
(LVQ) and self-organizing maps (SOMs).
These algorithms also memorize the training
examples and use them to make predictions on
new data, but they use different techniques to
do so.
Advantages:
1. Instead of estimating for the entire
instance set, local approximations can be
made to the target function.
2. This algorithm can adapt to new data
easily, one which is collected as we go .
Disadvantages:
1. Classification costs are high
2. Large amount of memory required to store
the data, and each query involves starting
the identification of a local model from
scratch.
Some of the instance-based learning
algorithms are :
1. K Nearest Neighbor (KNN)
2. Self-Organizing Map (SOM)
3. Learning Vector Quantization (LVQ)
4. Locally Weighted Learning (LWL)
5. Case-Based Reasoning