0% found this document useful (0 votes)
34 views

UNIT I Introduction To Machine Learning

Introduction to Machine Learning

Uploaded by

Kunal Ahire
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

UNIT I Introduction To Machine Learning

Introduction to Machine Learning

Uploaded by

Kunal Ahire
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 150

Unit I

Introductio
n to
Machine
Learning
KUNAL AHIRE
What is Machine Learning?
 Originally developed as a subfield of Artificial Intelligence (AI), one of the goals
behind machine learning was to replace the need for developing computer programs
“manually."
 Considering that programs are being developed to automate processes, we can think
of machine learning as the process of “automating automation.”
 In other words, machine learning lets computers “create" programs (often, the intent
for developing these programs is making predictions) themselves.
 In other words, machine learning is the process of turning data into programs.
 Machine learning is the field of study that gives computers the ability to learn
without being explicitly programmed.
Machine Learning Vs Classic
Programming
Tom Mitchell's description
A computer program is said to learn from experience E with respect to some
class of tasks T and performance measure P, if its performance at tasks in T, as
measured by P, improves with experience E - Tom Mitchell, Machine Learning
Professor at Carnegie Mellon University
To illustrate this quote with an example, consider the problem of recognizing
handwritten digits:
Task T: classifying handwritten digits from images
Performance measure P: percentage of digits classified correctly
Training experience E: dataset of digits given classifications, e.g., MNIST
Tom Mitchell's description
Classic and Adaptive Machines
 Since time immemorial, human beings have built tools and machines to
simplify their work and reduce the overall effort needed to complete many
different tasks.
 A machine is immediately considered useful and destined to be continuously
improved if its users can easily understand what tasks can be completed with
less effort or completely automatically.
In the latter case, some intelligence seems to appear next to cogs, wheels, or
axles. So a further step can be added to our evolution list: automatic machines,
built (nowadays we'd say programmed) to accomplish specific goals by
transforming energy into work.
Classic and Adaptive Machines
In the following figure, there's a generic representation of a classical system
that receives some input values, processes them, and produces output results:
Classic and Adaptive Machines
Programmable computers are widespread, flexible, and more and more
powerful instruments; moreover, the diffusion of the internet allowed us to
share software applications and related information with minimal effort.
The word-processing software that I'm using, my email client, a web browser,
and many other common tools running on the same machine are all examples
of such flexibility.
It's undeniable that the IT revolution dramatically changed our lives and
sometimes improved our daily jobs, but without machine learning (and all its
applications), there are still many tasks that seem far out of computer domain.
Classic and Adaptive Machines
Spam filtering, Natural Language Processing, visual tracking with a webcam or
a smartphone, and predictive analysis are only a few applications that
revolutionized human-machine interaction and increased our expectations.
In many cases, they transformed our electronic tools into actual cognitive
extensions that are changing the way we interact with many daily situations.
They achieved this goal by filling the gap between human perception,
language, reasoning, and model and artificial instruments.
Classic and Adaptive Machines
Here's a schematic representation of an adaptive system:
Machine Learning Cycle
 Problem Understanding
 Data Collection
 Data Preprocessing
 Model Selection
 Model Building
 Model Evaluation
 Model Tuning
 Deployment
 Monitoring
Applications of Machine Learning
After the field of machine learning was \founded" more than a half a century ago, we
can now find applications of machine learning in almost every aspect of hour life.
Popular applications of machine learning include the following:
 Email spam detection
 Face detection and matching (e.g., iPhone X)
 Web search (e.g., DuckDuckGo, Bing, Google)
 Sports predictions
 Post office (e.g., sorting letters by zip codes)
ATMs (e.g., reading checks)
Applications of Machine Learning
Credit card fraud
 Stock predictions
 Smart assistants (Apple Siri, Amazon Alexa, . . . )
 Product recommendations (e.g., Netflix, Amazon)
 Self-driving cars (e.g., Uber, Tesla)
 Language translation (Google Translate)
 Sentiment analysis
 Drug design
 Medical diagnoses
Types of machine learning
algorithms
Regardless of whether the learner is a human or machine, the basic learning process is similar.
It can be divided into four interrelated components:
 Data storage utilizes observation, memory, and recall to provide a factual basis for further
reasoning.
 Abstraction involves the translation of stored data into broader representations and concepts.
 Generalization uses abstracted data to create knowledge and inferences that drive action in
new contexts.
 Evaluation provides a feedback mechanism to measure the utility of learned knowledge and
inform potential improvements.
 Machine learning algorithms are divided into categories according to their purpose.
Types of machine learning
algorithms
Main categories are
• Supervised learning (predictive model, "labeled" data)
• classification (Logistic Regression, Decision Tree, KNN, Random Forest, SVM, Naive Bayes, etc)
• numeric prediction (Linear Regression, KNN, Gradient Boosting & AdaBoost, etc)

• Unsupervised learning (descriptive model, "unlabeled" data)


• clustering (K-Means)
• pattern discovery

• Semi-supervised learning (mixture of "labeled" and "unlabeled" data).

• Reinforcement learning. Using this algorithm, the machine is trained to make specific decisions. It works
this way: the machine is exposed to an environment where it trains itself continually using trial and error.
This machine learns from past experience and tries to capture the best possible knowledge to make
accurate business decisions. Example of Reinforcement Learning: Markov Decision Process.
Types of machine learning
algorithms
Supervised Machine Learning
Supervised Machine Learning
 Supervised learning is defined as when a model gets trained on a
“Labelled Dataset”. Labelled datasets have both input and output
parameters.
In Supervised Learning algorithms learn to map points between inputs
and correct outputs.
It has both training and validation datasets labelled.
Supervised Machine Learning
Supervised Machine Learning
Example:
 Consider a scenario where you have to build an image classifier to differentiate
between cats and dogs.
 If you feed the datasets of dogs and cats’ labelled images to the algorithm, the
machine will learn to classify between a dog or a cat from these labeled images.
When we input new dog or cat images that it has never seen before, it will use
the learned algorithms and predict whether it is a dog or a cat.
This is how supervised learning works and this is particularly an image
classification.
Supervised Machine Learning
There are two main categories of supervised learning that are mentioned
below:
 Classification
 Regression
Supervised Machine Learning
Classification
 Classification deals with predicting categorical target variables, which
represent discrete classes or labels.
 For instance, classifying emails as spam or not spam, or predicting
whether a patient has a high risk of heart disease.
 Classification algorithms learn to map the input features to one of the
predefined classes.
Supervised Machine Learning
Here are some classification algorithms:
 Logistic Regression
 Support Vector Machine
 Random Forest
 Decision Tree
 K-Nearest Neighbors (KNN)
 Naive Bayes
Supervised Machine Learning
Regression
 Regression, on the other hand, deals with predicting continuous target
variables, which represent numerical values.
 For example, predicting the price of a house based on its size, location,
and amenities, or forecasting the sales of a product.
 Regression algorithms learn to map the input features to a continuous
numerical value.
Supervised Machine Learning
Here are some regression algorithms:
 Linear Regression
 Polynomial Regression
 Ridge Regression
 Lasso Regression
 Decision tree
 Random Forest
Advantages of Supervised
Machine Learning
 Supervised Learning models can have high accuracy as they are trained
on labelled data.
 The process of decision-making in supervised learning models is often
interpretable.
 It can often be used in pre-trained models which saves time and
resources when developing new models from scratch.
Disadvantages of Supervised
Machine Learning
 It has limitations in knowing patterns and may struggle with unseen or
unexpected patterns that are not present in the training data.
 It can be time-consuming and costly as it relies on labeled data only.
 It may lead to poor generalizations based on new data.
Applications of Supervised
Learning
• Image classification: Identify objects, faces, and other •Autonomous vehicles: Recognize and respond to
features in images. objects in the environment.
• Natural language processing: Extract information from •Email spam detection: Classify emails as spam or not
text, such as sentiment, entities, and relationships. spam.
• Speech recognition: Convert spoken language into text. •Quality control in manufacturing: Inspect products for
defects.
• Recommendation systems: Make personalized
recommendations to users. •Credit scoring: Assess the risk of a borrower defaulting
on a loan.
• Predictive analytics: Predict outcomes, such as sales,
customer churn, and stock prices. •Gaming: Recognize characters, analyze player behavior,
and create NPCs.
• Medical diagnosis: Detect diseases and other medical
conditions. •Customer support: Automate customer support tasks.
•Weather forecasting: Make predictions for
• Fraud detection: Identify fraudulent transactions.
temperature, precipitation, and other meteorological
parameters.
•Sports analytics: Analyze player performance, make
Unsupervised Machine Learning
Unsupervised Learning
 Unsupervised learning is a type of machine learning technique in which an
algorithm discovers patterns and relationships using unlabeled data.
 Unlike supervised learning, unsupervised learning doesn’t involve
providing the algorithm with labeled target outputs.
 The primary goal of Unsupervised learning is often to discover hidden
patterns, similarities, or clusters within the data, which can then be used for
various purposes, such as data exploration, visualization, dimensionality
reduction, and more.
Unsupervised Machine Learning
Unsupervised Machine Learning
Example
 Consider that you have a dataset that contains information about the
purchases you made from the shop.
 Through clustering, the algorithm can group the same purchasing
behavior among you and other customers, which reveals potential
customers without predefined labels.
 This type of information can help businesses get target customers as
well as identify outliers.
Unsupervised Machine Learning
Two main categories of unsupervised learning are mentioned
below:
 Clustering
 Association
Unsupervised Machine Learning
Clustering
 Clustering is the process of grouping data points into clusters based on their similarity. This
technique is useful for identifying patterns and relationships in data without the need for
labeled examples.
Here are some clustering algorithms:
 K-Means Clustering algorithm
 Mean-shift algorithm
 DBSCAN Algorithm
 Principal Component Analysis
 Independent Component Analysis
Unsupervised Machine Learning
Association
 Association rule learning is a technique for discovering relationships between items
in a dataset.
It identifies rules that indicate the presence of one item implies the presence of
another item with a specific probability.
Here are some association rule learning algorithms:
 Apriori Algorithm
 Eclat
 FP-growth Algorithm
Advantages of Unsupervised
Machine Learning
 It helps to discover hidden patterns and various relationships between
the data.
 Used for tasks such as customer segmentation, anomaly detection, and
data exploration.
 It does not require labeled data and reduces the effort of data labeling.
Disadvantages of Unsupervised
Machine Learning
 Without using labels, it may be difficult to predict the quality of the
model’s output.
 Cluster Interpretability may not be clear and may not have meaningful
interpretations.
 It has techniques such as autoencoders and dimensionality reduction
that can be used to extract meaningful features from raw data.
Applications of Unsupervised
Learning
• Clustering: Group similar data points into clusters. •Data preprocessing: Help with data preprocessing tasks

• Anomaly detection: Identify outliers or anomalies in data. such as data cleaning, imputation of missing values, and
data scaling.
• Dimensionality reduction: Reduce the dimensionality of
•Market basket analysis: Discover associations between
data while preserving its essential information.
products.
• Recommendation systems: Suggest products, movies, or
•Genomic data analysis: Identify patterns or group genes
content to users based on their historical behavior or
preferences. with similar expression profiles.
•Image segmentation: Segment images into meaningful
• Topic modeling: Discover latent topics within a collection of
documents. regions.
•Community detection in social networks: Identify
• Density estimation: Estimate the probability density
function of data. communities or groups of individuals with similar interests or
connections.
• Image and video compression: Reduce the amount of
storage required for multimedia content. •Customer behavior analysis: Uncover patterns and
insights for better marketing and product recommendations.
• Exploratory data analysis (EDA): Explore data and gain
insights before defining specific tasks. •Content recommendation: Classify and tag content to
Semi-Supervised Learning
 Semi-Supervised learning is a machine learning algorithm that works
between the supervised and unsupervised learning so it uses both labelled
and unlabelled data.
 It’s particularly useful when obtaining labeled data is costly, time-
consuming, or resource-intensive.
 This approach is useful when the dataset is expensive and time-
consuming.
 Semi-supervised learning is chosen when labeled data requires skills and
relevant resources in order to train or learn from it.
Semi-Supervised Learning
 We use these techniques when we are dealing with data that is a little
bit labeled and the rest large portion of it is unlabeled.
 We can use the unsupervised techniques to predict labels and then
feed these labels to supervised techniques.
 This technique is mostly applicable in the case of image data sets
where usually all images are not labeled.
Semi-Supervised Learning
Semi-Supervised Learning
Example:
Consider that we are building a language translation model, having
labeled translations for every sentence pair can be resource-intensive.
It allows the models to learn from labeled and unlabeled sentence pairs,
making them more accurate.
This technique has led to significant improvements in the quality of
machine translation services.
Advantages of Semi- Supervised
Machine Learning
 It leads to better generalization as compared to supervised learning, as
it takes both labeled and unlabeled data.
 Can be applied to a wide range of data.
Disadvantages of Semi-
Supervised Machine Learning
 Semi-supervised methods can be more complex to implement
compared to other approaches.
 It still requires some labeled data that might not always be available or
easy to obtain.
 The unlabeled data can impact the model performance accordingly.
Applications of Semi-Supervised
Learning
 Image Classification and Object Recognition: Improve the accuracy of models by
combining a small set of labeled images with a larger set of unlabeled images.
 Natural Language Processing (NLP): Enhance the performance of language models and
classifiers by combining a small set of labeled text data with a vast amount of unlabeled text.
 Speech Recognition: Improve the accuracy of speech recognition by leveraging a limited
amount of transcribed speech data and a more extensive set of unlabeled audio.
 Recommendation Systems: Improve the accuracy of personalized recommendations by
supplementing a sparse set of user-item interactions (labeled data) with a wealth of unlabeled
user behavior data.
 Healthcare and Medical Imaging: Enhance medical image analysis by utilizing a small set of
labeled medical images alongside a larger set of unlabeled images.
Reinforcement Machine Learning
 Reinforcement machine learning algorithm is a learning method that interacts with the
environment by producing actions and discovering errors.
 Trial, error, and delay are the most relevant characteristics of reinforcement learning.

 In this technique, the model keeps on increasing its performance using Reward Feedback to learn
the behavior or pattern.
 These algorithms are specific to a particular problem e.g. Google Self Driving car, AlphaGo where
a bot competes with humans and even itself to get better and better performers in Go Game.
 Each time we feed in data, they learn and add the data to their knowledge which is training data.

 So, the more it learns the better it gets trained and hence experienced.
Reinforcement Machine Learning
Here are some of the most common reinforcement learning algorithms:
 Q-learning: Q-learning is a model-free RL algorithm that learns a Q-function, which
maps states to actions. The Q-function estimates the expected reward of taking a
particular action in a given state.
 SARSA (State-Action-Reward-State-Action): SARSA is another model-free RL
algorithm that learns a Q-function. However, unlike Q-learning, SARSA updates the Q-
function for the action that was actually taken, rather than the optimal action.
 Deep Q-learning: Deep Q-learning is a combination of Q-learning and deep learning.
Deep Q-learning uses a neural network to represent the Q-function, which allows it to
learn complex relationships between states and actions.
Reinforcement Machine Learning
Reinforcement Machine Learning
Example:
 Consider that you are training an AI agent to play a game like chess.

 The agent explores different moves and receives positive or negative feedback based
on the outcome.
 Reinforcement Learning also finds applications in which they learn to perform tasks by
interacting with their surroundings.
Types Reinforcement Machine
Learning
Positive reinforcement
 Rewards the agent for taking a desired action.

 Encourages the agent to repeat the behavior.

Examples: Giving a treat to a dog for sitting, providing a point in a game for a correct answer.

Negative reinforcement
 Removes an undesirable stimulus to encourage a desired behavior.

Discourages the agent from repeating the behavior.

Examples: Turning off a loud buzzer when a lever is pressed, avoiding a penalty by completing
a task.
Advantages of Reinforcement
Machine Learning
 It has autonomous decision-making that is well-suited for tasks and that can
learn to make a sequence of decisions, like robotics and game-playing.
 This technique is preferred to achieve long-term results that are very difficult
to achieve.
 It is used to solve a complex problems that cannot be solved by conventional
techniques.
Disadvantages of Reinforcement
Machine Learning
 Training Reinforcement Learning agents can be computationally expensive and
time-consuming.
 Reinforcement learning is not preferable to solving simple problems.
 It needs a lot of data and a lot of computation, which makes it impractical and
costly.
Applications of Reinforcement
Machine Learning
• Game Playing: RL can teach agents to play games, even •Supply Chain and Inventory Management: RL can be
complex ones.
used to optimize supply chain operations.
• Robotics: RL can teach robots to perform tasks •Energy Management: RL can be used to optimize
autonomously.
energy consumption.
• Autonomous Vehicles: RL can help self-driving cars •Game AI: RL can be used to create more intelligent and
navigate and make decisions.
adaptive NPCs in video games.
• Recommendation Systems: RL can enhance •Adaptive Personal Assistants: RL can be used to
recommendation algorithms by learning user preferences. improve personal assistants.
• Healthcare: RL can be used to optimize treatment plans •Virtual Reality (VR) and Augmented Reality
and drug discovery. (AR): RL can be used to create immersive and interactive
• Natural Language Processing (NLP): RL can be used in experiences.
dialogue systems and chatbots. •Industrial Control: RL can be used to optimize industrial
• Finance and Trading: RL can be used for algorithmic processes.
trading •Education: RL can be used to create adaptive learning
• Agriculture: RL can be used to optimize agricultural systems.
Disadvantages of Reinforcement
Machine Learning
 Training Reinforcement Learning agents can be computationally expensive and
time-consuming.
 Reinforcement learning is not preferable to solving simple problems.
 It needs a lot of data and a lot of computation, which makes it impractical and
costly.
Important Elements of Machine
Learning
There are six elements of machine learning :

1. Data

2. Defining a Task

3. Applying Model

4. Calculating Loss

5. Learning Algorithm

6. Evaluation
Important Elements of Machine
Learning
Data (The fossil fuel of machine learning):
 Data means information. All types and formats of information.
 Today, there is an enormous amount of data produced every second, which can be used
to answer so many questions.
 There is text data as well as audio-video data, there is structured data as well as
unstructured data.
 One important thing to remember is that it doesn't matter in which format you get the
data, at the end all the data needs to be encoded as numbers before feeding it to the
computers.
Important Elements of Machine
Learning
 A typical dataset required to perform any ML prediction is of high
dimensions meaning it consists of millions of rows/data
points/observations with typically thousands or maybe millions of
columns/parameters/features.
 A dataset could be presented with inputs as well as its
corresponding outputs which is ideal for performing any supervised
learning task by learning the relationship between i/p and o/p and if
the dataset doesn’t contain any corresponding output w.r.t to inputs
then we can only perform unsupervised learning task.
Important Elements of Machine
Learning
 A data could be structured(represented in the tabular form e.g. — sales
data/file records) and unstructured(incoming feed on social media
websites)
Important Elements of Machine
Learning
The data that has to be fed to the model for training should be in
machine-readable form i.e. it should be encoded as a number.
E.g. -
1.Certain text data like reviews should be represented in the numerical format
using one-hot encoding.
2.Image data can be represented in RGB format.
3.Video (a collection of frames) can be represented in numerical format.
4.Speech data could be represented in the numerical format using variation in
amplitude.
Important Elements of Machine
Learning
Task ( Setting an objective of the ML project with the curated
dataset):
 Based on the procured/curated dataset we can define our task
accordingly.
 If we have labeled training dataset i.e. it contains input(x’s) and its
corresponding labels(y’s) we can easily perform supervised
learning(classification/regression) and if we don’t have labeled training
dataset doesn’t contains corresponding labels(y’s) we can only perform
unsupervised learning(clustering/generation).
Important Elements of Machine
Learning

Different Task in Machine


Important Elements of Machine
Learning
Model ( Mathematical formulation of a task )
 After assigning of task, various model can be proposed.
But which model is appropriate is to be found, for that Prediction of
model is nothing but finding f(x) in the relation y=f(x)[unknown relation]
 f(x) can be any function.
Important Elements of Machine
Learning
Some of the complex models used for training are :
Sigmoid Neuron (1/(1+e^-x))
Feed Forward Neural Network (FFN)
Convolutional Neural Network (CNN)
Recurrent Neural Network (RNN)
Long Short Term Memory (LSTM)
Important Elements of Machine
Learning
Loss Function(How could we say which model best fit for estimating the
correct relationship for the task ? )
Important Elements of Machine
Learning
 Loss function calculates the difference between true output (y) and the approximated
value (^f(x)). It is represented with L.
 If L=0 then the estimate of our model is exactly accurate.
Different loss functions are :

1. Square Error Loss


2. Cross-Entropy Loss
3. KL divergence
4. Hinge Loss
5. Huber Loss etc.
Important Elements of Machine
Learning
Learning Algorithm(How do we estimate different parameters ?):
 Under this element, the success of machine learning lies.
 Parameter estimation in machine learning is a kind of search operation.
 We can compute the parameters through a learning algorithm and it
becomes an optimization problem where we try to optimize the
parameters by minimizing the loss.
 Hence, the learning algorithm and loss function go hand in hand.
Important Elements of Machine
Learning
Some of the popular learning algorithm/optimization solvers
are :
1. Gradient Descent
2. Adagrad
3. RMSProp
4. ADAM
5. Backpropagation
Important Elements of Machine
Learning
Evaluation (How do we compute the accuracy of an ML/DL model ?):
Accuracy = (No. of correct prediction )/(Total no. of prediction)
 Calculating accuracy indicates how efficient the model is and is more
interpretable for the end-user than the loss function.
 Top-K Accuracy: Out of the top-k prediction made by the prediction if we
can find the correct output among these then we can accept the model.
Top-K Accuracy = (No. of correct prediction made in top-k) / (Total
no. of prediction)
Important Elements of Machine
Learning
 Test data (data points that haven’t been seen by the machine).
 The standard evaluation metric in the case of object detection where
some action is required to be taken is precision and recall.
Precision = (No. of correct action) / (Total no. of action taken )
Recall = (Actual no. of correct action) / (Total No. of times correct
action to be taken)
Important Elements of Machine
Learning
Data Formats
Data is a crucial component in the field of Machine Learning.
It refers to the set of observations or measurements that can be used to
train a machine-learning model.
The quality and quantity of data available for training and testing play a
significant role in determining the performance of a machine-learning
model.
Data can be in various forms such as numerical, categorical, or time-series
data, and can come from various sources such as databases, spreadsheets,
or APIs.
Data Formats
Machine learning algorithms use data to learn patterns and relationships
between input variables and target outputs, which can then be used for
prediction or classification tasks.
Data is typically divided into two types:
1. Labeled data
2. Unlabeled data
Data Formats
 Labeled data includes a label or target variable that the model is trying
to predict, whereas unlabeled data does not include a label or target
variable.
 The data used in machine learning is typically numerical or categorical.
 Numerical data includes values that can be ordered and measured,
such as age or income.
 Categorical data includes values that represent categories, such as
gender or type of fruit.
Data Formats
 Data can be divided into training and testing sets.
 The training set is used to train the model, and the testing set is used to
evaluate the performance of the model.
 It is important to ensure that the data is split in a random and
representative way.
 Data preprocessing is an important step in the machine learning pipeline.
 This step can include cleaning and normalizing the data, handling missing
values, and feature selection or engineering.
How do we split data in Machine
Learning?
Training Data:
 The part of data we use to train our model.
 This is the data that your model actually sees(both input and output) and learns
from.
Validation Data:
 The part of data that is used to do a frequent evaluation of the model, fit on the
training dataset along with improving involved hyperparameters (initially set
parameters before the model begins learning).
 This data plays its part when the model is actually training.
How do we split data in Machine
Learning?
Testing Data:
Once our model is completely trained, testing data provides an unbiased
evaluation.
 When we feed in the inputs of Testing data, our model will predict some
values(without seeing actual output).
 After prediction, we evaluate our model by comparing it with the actual output
present in the testing data.
 This is how we evaluate and see how much our model has learned from the
experiences feed in as training data, set at the time of training.
How do we split data in Machine
Learning?
Parametric Machine Learning
Algorithms
 A learning model that summarizes data with a set of parameters of fixed size (independent of the
number of training examples) is called a parametric model. No matter how much data you throw at
a parametric model, it won’t change its mind about how many parameters it needs.
The algorithms involve two steps:
 Select a form for the function.
 Learn the coefficients for the function from the training data.
 An easy-to-understand functional form for the mapping function is a line, as is used in linear
regression:
b0 + b1*x1 + b2*x2 = 0
 Where b0, b1 and b2 are the coefficients of the line that control the intercept and slope, and x1
and x2 are two input variables.
Parametric Machine Learning
Algorithms
It is assumed that the functional form of a line helps to simplify the learning process.
 At this moment, there is a need to estimate the coefficient of the line equation and of
predictive model for the problem.
Some more examples of parametric machine learning algorithms include:
1. Logistic Regression
2. Linear Discriminant Analysis
3. Perceptron
4. Naive Bayes
5. Simple Neural Networks
Parametric Machine Learning
Algorithms
Benefits of Parametric Machine Learning Algorithms:
 Simpler: These methods are easier to understand and interpret results.
 Speed: Parametric models are very fast to learn from data.
 Less Data: They do not require as much training data and can work well even if the fit to the
data is not perfect.
Limitations of Parametric Machine Learning Algorithms:
 Constrained: By choosing a functional form these methods are highly constrained to the
specified form.
 Limited Complexity: The methods are more suited to simpler problems.
 Poor Fit: In practice, the methods are unlikely to match the underlying mapping function.
Nonparametric Machine
Learning Algorithms
 Nonparametric methods are good when you have a lot of data and no prior knowledge,
and when you don’t want to worry too much about choosing just the right features.
 Nonparametric methods seek to best fit the training data in constructing the mapping
function, whilst maintaining some ability to generalize to unseen data.
 As such, they are able to fit a large number of functional forms.
 An easy to understand nonparametric model is the k-nearest neighbors algorithm that
makes predictions based on the k most similar training patterns for a new data instance.
 The method does not assume anything about the form of the mapping function other
than patterns that are close are likely to have a similar output variable.
Nonparametric Machine
Learning Algorithms
Some more examples of popular nonparametric machine learning
algorithms are:
1. k-Nearest Neighbors
2. Decision Trees like CART and C4.5
3. Support Vector Machines
Nonparametric Machine
Learning Algorithms
Benefits of Nonparametric Machine Learning Algorithms:
 Flexibility: Capable of fitting a large number of functional forms.
 Power: No assumptions (or weak assumptions) about the underlying function.
 Performance: Can result in higher performance models for prediction.

Limitations of Nonparametric Machine Learning Algorithms:


 More data: Require a lot more training data to estimate the mapping function.
 Slower: A lot slower to train as they often have far more parameters to train.
 Overfitting: More of a risk to overfit the training data and it is harder to explain why
specific predictions are made.
Multiclass Classification
 Multiclass classification is a machine learning classification task that
consists of more than two classes, or outputs.
 For example, using a model to identify animal types in images from
an encyclopedia is a multiclass classification example because there are
many different animal classifications that each image can be classified
as.
 Multiclass classification also requires that a sample only have one class
(i.e. an elephant is only an elephant; it is not also a lemur).
Multiclass Classification
 Outside of regression, multiclass classification is probably the most common machine-
learning task.
 In classification, we are presented with the number of training examples divided into K
separate classes, and we build a machine learning model to predict which of those classes
some previously unseen data belongs to (i.e. the animal types from the previous example).
 In seeing the training dataset, the model learns patterns specific to each class and uses
those patterns to predict the membership of future data.
 For instance, images of cats may all follow a pattern of pointed ears and whiskers, helping
the model to identify future images of cats as compared to other animals without whiskers
or pointed ears.
Scaling and Normalization
The difference is that:
 in scaling, you're changing the range of your data, while
 in normalization, you're changing the shape of the distribution of your
data.
Scaling
 This means that you're transforming your data so that it fits within a
specific scale, like 0-100 or 0-1.
 You want to scale data when you're using methods based on measures
of how far apart data points are, like support vector machines (SVM) or k-
nearest neighbors (KNN).
 With these algorithms, a change of "1" in any numeric feature is given
the same importance.
Scaling
 For example, you might be looking at the prices of some products in both Yen and US
Dollars.
 One US Dollar is worth about 100 Yen, but if you don't scale your prices, methods like
SVM or KNN will consider a difference in price of 1 Yen as important as a difference of 1
US Dollar!
 This clearly doesn't fit with our intuitions of the world.
 With currency, you can convert between currencies. But what about if you're looking
at something like height and weight?
 It's not entirely clear how many pounds should equal one inch (or how many
kilograms should equal one meter).
Scaling
Notice
that
the shape
of the
data
doesn't
change,
but that
instead of
ranging
from 0 to
8ish, it
now
ranges
from 0 to
Normalization
 Scaling just changes the range of your data.
 Normalization is a more radical transformation.
 The point of normalization is to change your observations so that they can be
described as a normal distribution.
 Also known as the "bell curve", this is a specific statistical distribution where a
roughly equal observations fall above and below the mean, the mean and the
median are the same, and there are more observations closer to the mean.
 The normal distribution is also known as the Gaussian distribution.
Normalization
 The method we're using to normalize here is called the Box-Cox
Transformation.
 Let's take a quick peek at what normalizing some data looks like:
Normalization
Notice that
the shape of
our data has
changed.
Before
normalizing it
was almost L-
shaped. But
after
normalizing it
looks more
like the
outline of a
bell (hence
"bell curve").
TYPES OF DATA
 The data collected from various sources through the use of different
tools and techniques generally consists of numerical figures, ratings,
narrations, responses to open-ended questions comprising a
questionnaire or an interview schedule, quotations, field notes etc.
 In educational research usually, the use of two types of data are
recognized.
 These are qualitative data and quantitative data.
TYPES OF DATA
 Quantitative and qualitative methods generate different types of data.

 Quantitative data is expressed as numbers; qualitative data is


expressed as words.
 Quantitative and qualitative methods can be combined in many ways to
build on the strengths of both, and minimize their relative weaknesses.
 There is a growing consensus that both are important.
 This has led to an increased interest in mixed methods evaluations.
TYPES OF DATA
 Quantitative and qualitative methods generate different types of data.
 In general, quantitative methods result in quantitative data, whilst
qualitative methods produce qualitative data.
 Quantitative data is expressed in numbers (e.g. units, prices,
proportions, rates of change and ratios).
 Qualitative data is expressed as words (e.g. statements, paragraphs,
stories, case studies and quotations).
Qualitative Data
 Qualitative data provides information about the quality of an object or
information which cannot be measured.
 For example, if we consider the quality of performance of students in terms
of ‘Good’, ‘Average’, and ‘Poor’, it falls under the category of qualitative data.
 Also, name or roll number of students are information that cannot be
measured using some scale of measurement.
 So they would fall under qualitative data.
 Qualitative data is also called categorical data.
Qualitative Data
 Qualitative data can be further subdivided into two types as follows:

1. Nominal data
2. Ordinal data
Qualitative data
 Nominal data is one which has no numeric value, but a named value. It is used
for assigning named values to attributes. Nominal values cannot be quantified.
 Examples of nominal data are

1. Blood group: A, B, O, AB, etc.


2. Nationality: Indian, American, British, etc.
3. Gender: Male, Female, Other
 Note:- A special case of nominal data is when only two labels are possible, e.g.
pass/fail as a result of an examination.
Qualitative data
 It is obvious, mathematical operations such as addition, subtraction,
multiplication, etc. cannot be performed on nominal data.
 For that reason, statistical functions such as mean, variance, etc. can
also not be applied on nominal data.
 However, a basic count is possible. So mode, i.e. most frequently
occurring value, can be identified for nominal data.
Qualitative data
 Ordinal data, in addition to possessing the properties of nominal data, can also be
naturally ordered.
 This means ordinal data also assigns named values to attributes but unlike nominal
data, they can be arranged in a sequence of increasing or decreasing value so that
we can say whether a value is better than or greater than another value.
 Examples of ordinal data are

1. Customer satisfaction: ‘Very Happy’, ‘Happy’, ‘Unhappy’, etc.


2. Grades: A, B, C, etc.
3. Hardness of Metal: ‘Very Hard’, ‘Hard’, ‘Soft’, etc.
Qualitative data
 Like nominal data, basic counting is possible for ordinal data.
 Hence, the mode can be identified.
 Since ordering is possible in case of ordinal data, median, and quartiles
can be identified in addition. Mean can still not be calculated.
Quantitative data
 Quantitative data relates to information about the quantity of an object –
hence it can be measured.
 For example, if we consider the attribute ‘marks’, it can be measured
using a scale of measurement.
 Quantitative data is also termed as numeric data.
 There are two types of quantitative data:

1. Interval data
2. Ratio data
Quantitative data
 Interval data is numeric data for which not only the order is known, but
the exact difference between values is also known.
 An ideal example of interval data is Celsius temperature.
 The difference between each value remains the same in Celsius
temperature.
 For example, the difference between 12°C and 18°C degrees is measurable
and is 6°C as in the case of difference between 15.5°C and 21.5°C.
 Other examples include date, time, etc.
Quantitative data
 For interval data, mathematical operations such as addition and
subtraction are possible.
 For that reason, for interval data, the central tendency can be
measured by mean, median, or mode.
 Standard deviation can also be calculated.
Quantitative data
 However, interval data do not have something called a ‘true zero’ value.
 For example, there is nothing called ‘0 temperature’ or ‘no temperature’.
Hence, only addition and subtraction applies for interval data.
 The ratio cannot be applied.
 This means, we can say a temperature of 40°C is equal to the
temperature of 20°C + temperature of 20°C.
 However, we cannot say the temperature of 40°C means it is twice as
hot as in temperature of 20°C.
Quantitative data
 Ratio data represents numeric data for which exact value can be
measured. Absolute zero is available for ratio data.
 Also, these variables can be added, subtracted, multiplied, or divided.
 The central tendency can be measured by mean, median, or mode and
methods of dispersion such as standard deviation.
 Examples of ratio data include height, weight, age, salary, etc.
Quantitative data
 Figure gives a summarized view of different types of data that we may
find in a typical machine learning problem.
Feature Selection
 Feature selection is the process of choosing the most important and relevant features
from your data that contribute the most to predicting the target variable.
 In simple terms, it's like picking out the most useful pieces of information from a large
set to make a better and more efficient model.
 Imagine you are trying to determine if an apple leaf is healthy or diseased based on
several features such as color, size, texture, and shape.
 If some features don't really help in making this determination (like the size might not be
as indicative as the color or texture), you can ignore them.
 By focusing only on the most useful features, you can build a model that is simpler,
faster, and often more accurate.
Feature Selection
Feature Selection
 The role of feature selection in machine learning is,

1. To reduce the model complexity and make it easier to interpret.

2. To speed up a learning algorithm.

3. To improve the predictive accuracy of a classification algorithm.

4. To improve the comprehensibility of the learning results.


Features Selection Algorithms
are as follows:
 While building a machine learning model for real-life dataset, we come across a lot
of features in the dataset and not all these features are important every time.
 Adding unnecessary features while training the model leads us to reduce the overall
accuracy of the model, increase the complexity of the model and decrease the
generalization capability of the model and makes the model biased.
 Even the saying “Sometimes less is better” goes as well for the machine learning
model. Hence, feature selection is one of the important steps while building a
machine learning model.
 Its goal is to find the best possible set of features for building a machine learning
model.
Features Selection Algorithms
are as follows:
 There are three general classes of feature selection algorithms:

- Filter methods
- Wrapper methods
- Embedded methods
Filter Methods
 These methods are generally used while doing the pre-processing step.
 These methods select features from the dataset irrespective of the use of
any machine learning algorithm.
 In terms of computation, they are very fast and inexpensive and are very
good for removing duplicated, correlated, redundant features but these
methods do not remove multicollinearity.
 Selection of feature is evaluated individually which can sometimes help
when features are in isolation but will lag when a combination of features
can lead to increase in the overall performance of the model.
Filter Methods

Filter Methods
Implementation
Some techniques used are:
 Information Gain – It is defined as the amount of information provided by the
feature for identifying the target value and measures reduction in the entropy
values. Information gain of each attribute is calculated considering the target
values for feature selection.
 Chi-square test — Chi-square method (X2) is generally used to test the
relationship between categorical variables. It compares the observed values from
different attributes of the dataset to its expected value.
Some techniques used are:
 Fisher’s Score – Fisher’s Score selects each feature independently according
to their scores under Fisher criterion leading to a suboptimal set of features. The
larger the Fisher’s score is, the better is the selected feature.
 Correlation Coefficient – Pearson’s Correlation Coefficient is a measure of
quantifying the association between the two continuous variables and the
direction of the relationship with its values ranging from -1 to 1.
 Variance Threshold – It is an approach where all features are removed
whose variance doesn’t meet the specific threshold. By default, this method
removes features having zero variance. The assumption made using this
method is higher variance features are likely to contain more information.
Some techniques used are:
 Mean Absolute Difference (MAD) – This method is similar to variance threshold
method but the difference is there is no square in MAD. This method calculates the mean
absolute difference from the mean value.
 Dispersion Ratio – Dispersion ratio is defined as the ratio of the Arithmetic mean (AM)
to that of Geometric mean (GM) for a given feature. Its value ranges from +1 to ∞ as AM
≥ GM for a given feature. Higher dispersion ratio implies a more relevant feature.
 Mutual Dependence – This method measures if two variables are mutually dependent,
and thus provides the amount of information obtained for one variable on observing the
other variable. Depending on the presence/absence of a feature, it measures the amount
of information that feature contributes to making the target prediction.
Wrapper methods:
 Wrapper methods, also referred to as greedy algorithms train the algorithm by
using a subset of features in an iterative manner.
 Based on the conclusions made from training in prior to the model, addition and
removal of features takes place.
 Stopping criteria for selecting the best subset are usually pre-defined by the
person training the model such as when the performance of the model decreases
or a specific number of features has been achieved.
 The main advantage of wrapper methods over the filter methods is that they
provide an optimal set of features for training the model, thus resulting in better
accuracy than the filter methods but are computationally more expensive.
Wrapper methods:

Wrapper Methods
Implementation
Some techniques used are:
 Forward selection – This method is an iterative approach where we initially
start with an empty set of features and keep adding a feature which best
improves our model after each iteration. The stopping criterion is till the
addition of a new variable does not improve the performance of the model.
 Backward elimination – This method is also an iterative approach where we
initially start with all features and after each iteration, we remove the least
significant feature. The stopping criterion is till no improvement in the
performance of the model is observed after the feature is removed.
 Bi-directional elimination – This method uses both forward selection and
backward elimination technique simultaneously to reach one unique solution.
Some techniques used are:
 Exhaustive selection – This technique is considered as the brute force
approach for the evaluation of feature subsets. It creates all possible
subsets and builds a learning algorithm for each subset and selects the
subset whose model’s performance is best.
 Recursive elimination – This greedy optimization method selects
features by recursively considering the smaller and smaller set of features.
The estimator is trained on an initial set of features and their importance is
obtained using feature_importance_attribute. The least important features
are then removed from the current set of features till we are left with the
required number of features.
Embedded methods:
 In embedded methods, the feature selection algorithm is blended as
part of the learning algorithm, thus having its own built-in feature
selection methods.
 Embedded methods encounter the drawbacks of filter and wrapper
methods and merge their advantages.
 These methods are faster like those of filter methods and more
accurate than the filter methods and take into consideration a
combination of features as well.
Embedded methods:

Embedded Methods
Implementation
Some techniques used are:
 Regularization – This method adds a penalty to different parameters of
the machine learning model to avoid over-fitting of the model. This
approach of feature selection uses Lasso (L1 regularization) and Elastic
nets (L1 and L2 regularization). The penalty is applied over the coefficients,
thus bringing down some coefficients to zero. The features having zero
coefficient can be removed from the dataset.
 Tree-based methods – These methods such as Random Forest, Gradient
Boosting provides us feature importance as a way to select features as
well. Feature importance tells us which features are more important in
making an impact on the target feature.
Curse of Dimensionality in
Machine Learning
 Regularization – This method adds a penalty to different parameters of
the machine learning model to avoid over-fitting of the model. This
approach of feature selection uses Lasso (L1 regularization) and Elastic
nets (L1 and L2 regularization). The penalty is applied over the coefficients,
thus bringing down some coefficients to zero. The features having zero
coefficient can be removed from the dataset.
 Tree-based methods – These methods such as Random Forest, Gradient
Boosting provides us feature importance as a way to select features as
well. Feature importance tells us which features are more important in
making an impact on the target feature.
Principal Component Analysis
(PCA)
 As the number of features or dimensions in a dataset increases, the amount of data required
to obtain a statistically significant result increases exponentially.
 This can lead to issues such as overfitting, increased computation time, and reduced
accuracy of machine learning models this is known as the curse of dimensionality
problems that arise while working with high-dimensional data.
 As the number of dimensions increases, the number of possible combinations of features
increases exponentially, which makes it computationally difficult to obtain a representative
sample of the data and it becomes expensive to perform tasks such as clustering or
classification because it becomes.
 Additionally, some machine learning algorithms can be sensitive to the number of dimensions,
requiring more data to achieve the same level of accuracy as lower-dimensional data.
Principal Component Analysis
(PCA)
 To address the curse of dimensionality, Feature engineering techniques
are used which include feature selection and feature extraction.
 Dimensionality reduction is a type of feature extraction technique that
aims to reduce the number of input features while retaining as much of
the original information as possible.
What is Principal Component
Analysis(PCA)?
 Principal Component Analysis(PCA) technique was introduced by the
mathematician Karl Pearson in 1901.
 It works on the condition that while the data in a higher dimensional
space is mapped to data in a lower dimension space, the variance of the
data in the lower dimensional space should be maximum.
What is Principal Component
Analysis(PCA)?
 Principal Component Analysis (PCA) is a statistical procedure that uses an orthogonal
transformation that converts a set of correlated variables to a set of uncorrelated
variables.
 PCA is the most widely used tool in exploratory data analysis and in machine learning for
predictive models. Moreover,
 Principal Component Analysis (PCA) is an unsupervised learning algorithm technique used
to examine the interrelations among a set of variables.
 It is also known as a general factor analysis where regression determines a line of best fit.
 The main goal of Principal Component Analysis (PCA) is to reduce the dimensionality of a
dataset while preserving the most important patterns or relationships between the
variables without any prior knowledge of the target variables.
What is Principal Component
Analysis(PCA)?
 Principal Component Analysis (PCA) is used to reduce the
dimensionality of a data set by finding a new set of variables, smaller
than the original set of variables, retaining most of the sample’s
information, and useful for the regression and classification of data.
What is Principal Component
Analysis(PCA)?
 Principal Component Analysis (PCA) is a technique for dimensionality reduction
that identifies a set of orthogonal axes, called principal components, that capture
the maximum variance in the data.
 The principal components are linear combinations of the original variables in the
dataset and are ordered in decreasing order of importance.
 The total variance captured by all the principal components is equal to the total
variance in the original dataset.
 The first principal component captures the most variation in the data, but the
second principal component captures the maximum variance that is orthogonal
to the first principal component, and so on.
What is Principal Component
Analysis(PCA)?
 Principal Component Analysis can be used for a variety of purposes, including data
visualization, feature selection, and data compression.
 In data visualization, PCA can be used to plot high-dimensional data in two or three
dimensions, making it easier to interpret.
 In feature selection, PCA can be used to identify the most important variables in a
dataset.
 In data compression, PCA can be used to reduce the size of a dataset without losing
important information.
 In Principal Component Analysis, it is assumed that the information is carried in the
variance of the features, that is, the higher the variation in a feature, the more information
that features carries.
Step-By-Step Explanation of PCA
(Principal Component Analysis)
Step 1: Standardization

First, we need to standardize our dataset to ensure that each variable has a mean of 0
and a standard deviation of 1.

𝑍=(𝑋−𝜇)/𝜎​

Here,
• 𝜇 is the mean of independent features 𝜇={𝜇1,𝜇2,⋯,𝜇𝑚}

• σ is the standard deviation of independent features 𝜎={𝜎1,𝜎2,⋯,𝜎𝑚}


Step-By-Step Explanation of PCA
(Principal Component Analysis)
Step2: Covariance Matrix Computation

Covariance measures the strength of joint variability between two or more variables,
indicating how much they change in relation to each other. To find the covariance we
can use the formula:

The value of covariance can be positive, negative, or zeros.


• Positive: As the x1 increases x2 also increases.

• Negative: As the x1 increases x2 also decreases.

• Zeros: No direct relation


Step-By-Step Explanation of PCA
(Principal Component Analysis)
Step 3: Compute Eigenvalues and Eigenvectors of Covariance Matrix to Identify Principal
Components

Let A be a square nXn matrix and X be a non-zero vector for which

𝐴𝑋=𝜆𝑋

for some scalar values 𝜆λ. then 𝜆λ is known as the eigenvalue of matrix A and X is known as the
eigenvector of matrix A for the corresponding eigenvalue.

It can also be written as :

𝐴𝑋−𝜆𝑋=0

(𝐴−𝜆𝐼)𝑋=0


Step-By-Step Explanation of PCA
(Principal Component Analysis)
will be true only if (𝐴–𝜆𝐼)(A–λI) will be non-invertible (i.e. singular matrix). That means,
where I am the identity matrix of the same shape as matrix A. And the above conditions

∣𝐴–𝜆𝐼∣=0

From the above equation, we can find the eigenvalues \lambda, and therefore
corresponding eigenvector can be found using the equation

𝐴𝑋=𝜆𝑋
Advantages of Principal
Component Analysis
 Dimensionality Reduction: Principal Component Analysis is a popular technique
used for dimensionality reduction, which is the process of reducing the number of
variables in a dataset. By reducing the number of variables, PCA simplifies data
analysis, improves performance, and makes it easier to visualize data.
 Feature Selection: Principal Component Analysis can be used for feature selection,
which is the process of selecting the most important variables in a dataset. This is
useful in machine learning, where the number of variables can be very large, and it is
difficult to identify the most important variables.
 Data Visualization: Principal Component Analysis can be used for data visualization.
By reducing the number of variables, PCA can plot high-dimensional data in two or three
dimensions, making it easier to interpret.
Advantages of Principal
Component Analysis
 Multicollinearity: Principal Component Analysis can be used to deal with multicollinearity, which is a common
problem in a regression analysis where two or more independent variables are highly correlated. PCA can help
identify the underlying structure in the data and create new, uncorrelated variables that can be used in the
regression model.
 Noise Reduction: Principal Component Analysis can be used to reduce the noise in data. By removing the
principal components with low variance, which are assumed to represent noise, Principal Component Analysis can
improve the signal-to-noise ratio and make it easier to identify the underlying structure in the data.
 Data Compression: Principal Component Analysis can be used for data compression. By representing the data
using a smaller number of principal components, which capture most of the variation in the data, PCA can reduce
the storage requirements and speed up processing.
 Outlier Detection: Principal Component Analysis can be used for outlier detection. Outliers are data points
that are significantly different from the other data points in the dataset. Principal Component Analysis can
identify these outliers by looking for data points that are far from the other points in the principal component
space.
Disadvantages of Principal
Component Analysis
 Interpretation of Principal Components: The principal components created by Principal
Component Analysis are linear combinations of the original variables, and it is often difficult
to interpret them in terms of the original variables. This can make it difficult to explain the
results of PCA to others.
 Data Scaling: Principal Component Analysis is sensitive to the scale of the data. If the data
is not properly scaled, then PCA may not work well. Therefore, it is important to scale the
data before applying Principal Component Analysis.
 Information Loss: Principal Component Analysis can result in information loss. While
Principal Component Analysis reduces the number of variables, it can also lead to loss of
information. The degree of information loss depends on the number of principal components
selected. Therefore, it is important to carefully select the number of principal components to
retain.
Disadvantages of Principal
Component Analysis
 Non-linear Relationships: Principal Component Analysis assumes that the
relationships between variables are linear. However, if there are non-linear
relationships between variables, Principal Component Analysis may not work well.
 Computational Complexity: Computing Principal Component Analysis can be
computationally expensive for large datasets. This is especially true if the number
of variables in the dataset is large.
 Overfitting: Principal Component Analysis can sometimes result in overfitting,
which is when the model fits the training data too well and performs poorly on
new data. This can happen if too many principal components are used or if the
model is trained on a small dataset.
Dataset Validation Techniques
Cross-Validation in Machine Learning
 Cross-validation is a technique for validating the model efficiency by training it on the subset
of input data and testing on previously unseen subset of the input data.
 In machine learning, there is always the need to test the stability of the model.
 It means based only on the training dataset; we can't fit our model on the training dataset.
 For this purpose, we reserve a particular sample of the dataset, which was not part of the
training dataset.
 After that, we test our model on that sample before deployment, and this complete process
comes under cross-validation.
 This is something different from the general train-test split.
Dataset Validation Techniques
Hence the basic steps of cross-validations are:
 Reserve a subset of the dataset as a validation set.
 Provide the training to the model using the training dataset.
 Now, evaluate model performance using the validation set. If the model
performs well with the validation set, perform the further step, else check
for the issues.
Methods used for Cross-
Validation
Some common methods are used for cross-validation.
These methods are given below:
1.Validation Set Approach

2.Leave-P-out cross-validation

3.Leave one out cross-validation

4.K-fold cross-validation

5.Stratified k-fold cross-validation


Validation Set Approach
 We divide our input dataset into a training set and test or validation set
in the validation set approach.
 Both the subsets are given 50% of the dataset.
 But it has one of the big disadvantages that we are just using a 50%
dataset to train our model, so the model may miss out to capture
important information of the dataset.
 It also tends to give the underfitted model.
Leave-P-out cross-validation
 In this approach, the p datasets are left out of the training data.
 It means, if there are total n datapoints in the original input dataset, then n-p
data points will be used as the training dataset and the p data points as the
validation set.
 This complete process is repeated for all the samples, and the average error is
calculated to know the effectiveness of the model.
 There is a disadvantage of this technique; that is, it can be computationally
difficult for the large p.
Leave-P-out cross-validation
 This method is similar to the leave-p-out cross-validation, but instead of p, we need to
take 1 dataset out of training.
 It means, in this approach, for each learning set, only one datapoint is reserved, and
the remaining dataset is used to train the model.
 This process repeats for each datapoint.
 Hence for n samples, we get n different training set and n test set. It has the following
features:
- In this approach, the bias is minimum as all the data points are used.
- The process is executed for n times; hence execution time is high.
- This approach leads to high variation in testing the effectiveness of the model as we
iteratively check against one data point.
K-Fold Cross-Validation
 K-fold cross-validation approach divides the input dataset into K groups of
samples of equal sizes.
 These samples are called folds.
 For each learning set, the prediction function uses k-1 folds, and the rest of the
folds are used for the test set.
 This approach is a very popular CV approach because it is easy to understand,
and the output is less biased than other methods.
K-Fold Cross-Validation
The steps for k-fold cross-validation are:
• Split the input dataset into K groups
• For each group:
• Take one group as the reserve or test data set.
• Use remaining groups as the training dataset
• Fit the model on the training set and evaluate the performance of the model using
the test set.
K-Fold Cross-Validation
 Let's take an example of 5-folds cross-validation. So, the dataset is grouped
into 5 folds. On 1st iteration, the first fold is reserved for test the model, and
rest are used to train the model. On 2nd iteration, the second fold is used to test
the model, and rest are used to train the model. This process will continue until
each fold is not used for the test fold.
Stratified k-fold cross-validation
 This technique is similar to k-fold cross-validation with some little changes.
 This approach works on stratification concept, it is a process of rearranging the
data to ensure that each fold or group is a good representative of the complete
dataset.
 To deal with the bias and variance, it is one of the best approaches.
 It can be understood with an example of housing prices, such that the price of
some houses can be much high than other houses.
 To tackle such situations, a stratified k-fold cross-validation technique is useful.
Holdout Method
 This method is the simplest cross-validation technique among all.
 In this method, we need to remove a subset of the training data and use it to
get prediction results by training it on the rest part of the dataset.
 The error that occurs in this process tells how well our model will perform with
the unknown dataset.
 Although this approach is simple to perform, it still faces the issue of high
variance, and it also produces misleading results sometimes.

You might also like