Machine Learning -UNIT I Notes
Machine Learning -UNIT I Notes
INTRODUCTION
Machine Learning
Machine learning (ML) is a branch of artificial intelligence (AI) and computer science that
focuses on the using data and algorithms to enable AI to imitate the way that humans learn, gradually
improving its accuracy.
i.e Machine Learning is a subarea of artificial intelligence and it is the study of algorithms that
give the computers the ability to learn and make decisions based on data and not from explicit
instructions a popular example is learning to predict weather an email is spam or no spam by many
different emails of these two types.
Machine learning is broadly categorized into several types based on the nature of the learning process
and the type of data used. The primary types include:
SUPERVISED LEARNING
Supervised learning is the types of machine learning in which machines are trained using well
"labelled" training data, and on basis of that data, machines predict the output. The labelled data
means some input data is already tagged with the correct output.
In supervised learning, the training data provided to the machines work as the supervisor that
teaches the machines to predict the output correctly. It applies the same concept as a student learns
in the supervision of the teacher.
Supervised learning is a process of providing input data as well as correct output data to the
machine learning model. The aim of a supervised learning algorithm is to find a mapping function to
map the input variable(x) with the output variable(y).
In the real-world, supervised learning can be used for Risk Assessment, Image classification,
Fraud Detection, spam filtering, etc.
In supervised learning, models are trained using labelled dataset, where the model learns
about each type of data. Once the training process is completed, the model is tested on the basis of
test data and then it predicts the output.
The working of Supervised learning can be easily understood by the below example and diagram:
Suppose we have a dataset of different types of shapes which includes square, rectangle,
triangle, and Polygon. Now the first step is that we need to train the model for each shape.
If the given shape has four sides, and all the sides are equal, then it will be labelled as a Square.
If the given shape has three sides, then it will be labelled as a triangle.
If the given shape has six equal sides then it will be labelled as hexagon.
Now, after training, we test our model using the test set, and the task of the model is to identify
the shape. The machine is already trained on all types of shapes, and when it finds a new shape, it
classifies the shape on the bases of a number of sides, and predicts the output.
Key Concepts
Training Data: A dataset containing input-output pairs, where each input (feature set) is associated
with an output (label).
Features: The attributes or properties of the data used as input for the model.
Labels: The output values or categories that the model aims to predict.
Loss Function: A mathematical function that measures the difference between the predicted output
and the actual output. The goal is to minimize this difference.
Optimization Algorithm: A method used to minimize the loss function by adjusting the model's
parameters (e.g., weights in a neural network).
Classification: The task of predicting a discrete label or category for an input. The output is typically a
finite set of classes.
Binary Classification: Predicts one of two possible classes (e.g., spam or not spam).
Multiclass Classification: Predicts one of more than two classes (e.g., classifying types of
animals in images).
Regression: The task of predicting a continuous value for an input. The output is a real number, and
the goal is to predict quantities. Examples: Predicting house prices, estimating temperatures,
forecasting stock prices.
Common Algorithms
Linear Regression: Models the relationship between input features and a continuous output using a
linear function.
Logistic Regression: Used for binary classification problems; it models the probability of an instance
belonging to a particular class.
Decision Trees: A model that splits the data into subsets based on feature values, forming a tree
structure.
Random Forests: An ensemble method that combines multiple decision trees to improve accuracy
and reduce overfitting.
Support Vector Machines (SVM): Finds the optimal hyperplane that separates classes in the feature
space.
K-Nearest Neighbors (KNN): Classifies a data point based on the majority class among its k nearest
neighbors.
Naive Bayes: A probabilistic classifier based on Bayes' theorem, assuming independence between
features.
Neural Networks: Computational models inspired by the human brain, used for both classification and
regression tasks, especially effective for complex problems.
Data Preprocessing: Clean and prepare the data (e.g., handling missing values, normalizing features).
Model Selection: Choose an appropriate algorithm based on the nature of the task (classification or
regression).
Training: Train the model using the training data by optimizing the loss function.
Validation: Evaluate the model's performance using a separate validation dataset to fine-tune
hyperparameters and prevent overfitting.
Testing: Assess the final model's performance on a test dataset to evaluate its generalization ability.
Deployment: Deploy the trained model to make predictions on new, unseen data.
Applications
Healthcare: Diagnosing diseases, predicting patient outcomes.
Challenges
Overfitting: When the model learns not only the underlying patterns but also the noise in the training
data, leading to poor performance on new data.
Underfitting: When the model is too simple to capture the underlying patterns in the data, resulting
in poor performance.
Data Quality: The quality and quantity of training data significantly affect the model's performance.
Model Interpretability: Some models, particularly complex ones like deep neural networks, can be
difficult to interpret.
As the name suggests, unsupervised learning is a machine learning technique in which models
are not supervised using training dataset. Instead, models itself find the hidden patterns and insights
from the given data. It can be compared to learning which takes place in the human brain while
learning new things. It can be defined as:
Unsupervised learning is a type of machine learning where the model is trained on data that
is not labeled. Unlike supervised learning, there is no explicit output associated with each input in the
training dataset. The goal of unsupervised learning is to discover hidden patterns, relationships, or
structures within the data.
Example: Suppose the unsupervised learning algorithm is given an input dataset containing
images of different types of cats and dogs. The algorithm is never trained upon the given dataset,
which means it does not have any idea about the features of the dataset. The task of the unsupervised
learning algorithm is to identify the image features on their own. Unsupervised learning algorithm will
perform this task by clustering the image dataset into the groups according to similarities between
images.
Below are some main reasons which describe the importance of Unsupervised Learning:
Unsupervised learning is helpful for finding useful insights from the data.
Unsupervised learning is much similar as a human learns to think by their own experiences,
which makes it closer to the real AI.
Unsupervised learning works on unlabeled and uncategorized data which make unsupervised
learning more important.
In real-world, we do not always have input data with the corresponding output so to solve
such cases, we need unsupervised learning.
Once it applies the suitable algorithm, the algorithm divides the data objects into groups
according to the similarities and difference between the objects.
The unsupervised learning algorithm can be further categorized into two types of problems:
Clustering: Clustering is a method of grouping the objects into clusters such that objects with most
similarities remains into a group and has less or no similarities with the objects of another group.
Cluster analysis finds the commonalities between the data objects and categorizes them as per the
presence and absence of those commonalities.
Association: An association rule is an unsupervised learning method which is used for finding the
relationships between variables in the large database. It determines the set of items that occurs
together in the dataset. Association rule makes marketing strategy more effective. Such as people
who buy X item (suppose a bread) are also tend to purchase Y (Butter/Jam) item. A typical example of
Association rule is Market Basket Analysis.
K-means clustering
KNN (k-nearest neighbors)
Hierarchal clustering
Anomaly detection
Neural Networks
Principle Component Analysis
Independent Component Analysis
Apriori algorithm
Singular value decomposition
Key Concepts
Unlabeled Data: The dataset used for training contains only input features without
corresponding output labels.
Clustering: Grouping similar data points together based on their features.
Association: Discovering interesting relationships between variables in large datasets.
Dimensionality Reduction: Reducing the number of input variables while retaining the
essential information.
Unsupervised learning is used for more complex tasks as compared to supervised learning
because, in unsupervised learning, we don't have labeled input data.
Unsupervised learning is preferable as it is easy to get unlabeled data in comparison to labeled
data.
Unsupervised learning is intrinsically more difficult than supervised learning as it does not
have corresponding output.
The result of the unsupervised learning algorithm might be less accurate as input data is not
labeled, and algorithms do not know the exact output in advance.
Examples: Classification (e.g., spam detection in emails), regression (e.g., predicting house prices).
Common Algorithms: Linear regression, decision trees, support vector machines, neural networks.
Unsupervised Learning
Definition: In unsupervised learning, the algorithm is trained on unlabeled data and must find patterns
or structures within the data on its own.
Examples: Clustering (e.g., customer segmentation), association (e.g., market basket analysis).
Common Algorithms: K-means clustering, hierarchical clustering, principal component analysis (PCA).
Reinforcement Learning
Definition: A type of learning where an agent interacts with an environment and learns to make
decisions by receiving rewards or penalties.
Examples: Game playing (e.g., chess, Go), robotics (e.g., teaching robots to walk).
Testing: Evaluating the performance of the trained model on a separate dataset that it hasn’t seen
during training. This helps assess how well the model generalizes to new data.
Underfitting: When a model is too simple to capture the underlying patterns in the data, resulting in
poor performance even on the training data.
Feature Engineering
Definition: The process of selecting, modifying, and creating features (input variables) that will be used
by the machine learning model.
Importance: Good feature engineering can significantly improve the performance of a model.
Precision and Recall: Precision is the number of true positive results divided by the number of all
positive results, while recall is the number of true positive results divided by the number of positives
that should have been retrieved.
F1 Score: The harmonic mean of precision and recall, providing a balance between the two.
Mean Squared Error (MSE): Used in regression tasks, it measures the average squared difference
between the predicted and actual values.
Cross-Validation
Definition: A technique for evaluating the performance of a model by dividing the data into multiple
subsets and training/testing the model multiple times with different subsets.
Purpose: Helps ensure that the model’s performance is not dependent on a particular division of the
data.
Bias-Variance Tradeoff
Bias: Error due to overly simplistic models that do not capture the underlying patterns in the data.
Variance: Error due to models that are too complex and sensitive to small fluctuations in the training
data.
Tradeoff: A balance must be struck between bias and variance to achieve the best generalization
performance.
Tuning: The process of finding the best hyperparameters for a model, often done using techniques
like grid search or random search.
These concepts form the foundation of machine learning, guiding how algorithms are developed and
applied to solve real-world problems.
Machine learning life cycle involves seven major steps, which are given below:
Gathering Data
Data preparation
Data Wrangling
Analyse Data
Train the model
Test the model
Deployment
The most important thing in the complete process is to understand the problem and to know the
purpose of the problem. Therefore, before starting the life cycle, we need to understand the problem
because the good result depends on the better understanding of the problem.
In the complete life cycle process, to solve a problem, we create a machine learning system called
"model", and this model is created by providing "training". But to train a model, we need data, hence,
life cycle starts by collecting data.
Gathering Data:
Data Gathering is the first step of the machine learning life cycle. The goal of this step is to
identify and obtain all data-related problems.
In this step, we need to identify the different data sources, as data can be collected from
various sources such as files, database, internet, or mobile devices. It is one of the most important
steps of the life cycle. The quantity and quality of the collected data will determine the efficiency of
the output. The more will be the data, the more accurate will be the prediction.
By performing the above task, we get a coherent set of data, also called as a dataset. It will be used in
further steps.
Data preparation
After collecting the data, we need to prepare it for further steps. Data preparation is a step where we
put our data into a suitable place and prepare it to use in our machine learning training.
In this step, first, we put all data together, and then randomize the ordering of data.
Data exploration:
It is used to understand the nature of data that we have to work with. We need to understand the
characteristics, format, and quality of data.
A better understanding of data leads to an effective outcome. In this, we find Correlations, general
trends, and outliers.
Data pre-processing:
Data Wrangling
Data wrangling is the process of cleaning and converting raw data into a useable format. It is
the process of cleaning the data, selecting the variable to use, and transforming the data in a proper
format to make it more suitable for analysis in the next step. It is one of the most important steps of
the complete process. Cleaning of data is required to address the quality issues.
It is not necessary that data we have collected is always of our use as some of the data may
not be useful. In real-world applications, collected data may have various issues, including:
Missing Values
Duplicate data
Invalid data
Noise
It is mandatory to detect and remove the above issues because it can negatively affect the quality of
the outcome.
Data Analysis
Now the cleaned and prepared data is passed on to the analysis step. This step involves:
The aim of this step is to build a machine learning model to analyze the data using various
analytical techniques and review the outcome. It starts with the determination of the type of the
problems, where we select the machine learning techniques such as Classification, Regression, Cluster
analysis, Association, etc. then build the model using prepared data, and evaluate the model.
Hence, in this step, we take the data and use machine learning algorithms to build the model.
Train Model
Now the next step is to train the model, in this step we train our model to improve its
performance for better outcome of the problem.
We use datasets to train the model using various machine learning algorithms. Training a
model is required so that it can understand the various patterns, rules, and, features.
Test Model
Once our machine learning model has been trained on a given dataset, then we test the
model. In this step, we check for the accuracy of our model by providing a test dataset to it.
Testing the model determines the percentage accuracy of the model as per the requirement
of project or problem.
Deployment
The last step of machine learning life cycle is deployment, where we deploy the model in the
real-world system.
If the above-prepared model is producing an accurate result as per our requirement with
acceptable speed, then we deploy the model in the real system. But before deploying the project, we
will check whether it is improving its performance using available data or not. The deployment phase
is similar to making the final report for a project.
WEIGHT SPACE
In the context of machine learning, "weight space" refers to the multidimensional space defined by
the weights of a model, particularly in neural networks. Each point in this space represents a specific
configuration of the model's parameters, and the objective of training is to find an optimal point in
this space that minimizes the loss function (i.e., the error between the model's predictions and the
actual outcomes).
Key Concepts:
Dimensionality: The dimensionality of the weight space is determined by the number of parameters
(weights) in the model. For example, if a neural network has 10,000 parameters, the weight space is
a 10,000-dimensional space.
Loss Surface: The loss function, when plotted over the weight space, forms a surface known as the
loss surface. The goal of training is to find the lowest point on this surface, corresponding to the
minimum loss.
Gradient Descent: This is a common optimization technique used to navigate the weight space. The
algorithm iteratively adjusts the weights by moving in the direction of the negative gradient of the loss
function to reach a minimum.
Local vs. Global Minima: The weight space can have multiple minima, where the loss function has a
lower value. A global minimum is the absolute lowest point on the loss surface, while local minima are
points where the loss is lower than the surrounding points but not the lowest overall.
Flatness and Sharpness: In weight space, the nature of the minima can vary. Flat minima are regions
where small changes in the weights do not significantly increase the loss, which often correlates with
better generalization. Sharp minima are regions where small changes can lead to a significant increase
in loss, often leading to poorer generalization.
Visualizing weight space is challenging due to its high dimensionality, but techniques like PCA (Principal
Component Analysis) can reduce the dimensionality to visualize important structures, such as the path
the optimizer takes during training.
Weight space is crucial for understanding how neural networks learn and how different configurations
of weights affect the performance of the model.
Data Splitting
Training Set: The subset of data used to train the model. The model learns the patterns in this data.
Validation Set: A subset of data used to tune model parameters. It helps prevent overfitting by
validating the model’s performance during training.
Test Set: A separate subset of data used to evaluate the model’s performance after it has been trained.
This set should only be used once to assess the model's final performance.
Cross-validation
Cross-validation is a technique used to assess the performance and generalizability of a
machine learning model. It involves partitioning the data into subsets, training the model on some
subsets while testing it on the remaining subsets, and then repeating this process multiple times to
ensure that the model performs well across different portions of the data. This helps in avoiding
overfitting and provides a more accurate estimate of a model’s performance on unseen data.
Assess Model Performance: It provides a robust estimate of the model's performance by training and
testing it on different splits of the data.
Avoid Overfitting: Helps ensure that the model generalizes well to new, unseen data and is not just
performing well on the specific data it was trained on.
Efficient Use of Data: Allows for the efficient use of the entire dataset for both training and validation.
There are many types of cross validation techniques are available. Among this K-Fold Cross
validation is important.
K-fold cross-validation
K-fold cross-validation is a robust method for evaluating the performance of a machine
learning model. It helps ensure that the model's performance is not dependent on a particular train-
test split, providing a more generalizable measure of model accuracy. Here's a detailed explanation
and example code for k-fold cross-validation:
Repeat: This process is repeated k times, each time with a different fold as the validation set.
Average Performance: The performance metric (e.g., accuracy, precision, recall) is averaged across all
k trials to provide a more robust estimate of the model's performance.
Efficient Use of Data: Utilizes the entire dataset for both training and validation, making the most of
available data.
Less Bias: Reduces the bias associated with a single train-test split.
Considerations
Choice of k: Common choices are 5 or 10, but this depends on the dataset size and the specific use
case.
Computation Cost: Increases computational cost by training the model k times, which can be
significant for large datasets or complex models.
Confusion Matrix
A confusion matrix is a table used to evaluate the performance of a classification model by
summarizing the counts of correct and incorrect predictions for each class. It provides a detailed
breakdown of how well the model is performing, revealing not only the accuracy but also the types of
errors being made.
Structure
For a binary classification problem, the confusion matrix is a 2x2 table with the following
structure:
Definitions
True Positive (TP): The number of instances correctly predicted as positive.
False Positive (FP): The number of instances incorrectly predicted as positive (type I error).
True Negative (TN): The number of instances correctly predicted as negative.
False Negative (FN): The number of instances incorrectly predicted as negative (type II error).
Example Calculation
Model predictions:
Recall
Recall, also known as sensitivity or true positive rate, is a performance metric for classification models.
It measures the ability of the model to identify all relevant instances in a dataset, specifically focusing
on the proportion of actual positive instances that are correctly identified by the model.
F1 Score
The F1 score is a performance metric for classification models that combines precision and recall into
a single metric. It is the harmonic mean of precision and recall, providing a balance between the two
metrics. The F1 score is particularly useful when you need to account for both false positives and false
negatives and when dealing with imbalanced datasets.
Where:
Where:
FP (False Positives): The number of actual negative instances that were incorrectly predicted as
positive.
ROC AUC
The ROC AUC (Receiver Operating Characteristic - Area Under the Curve) is a performance
metric for evaluating the performance of binary classification models. The ROC curve plots the True
Positive Rate (TPR, also known as recall) against the False Positive Rate (FPR) at various threshold
settings. The AUC (Area Under the Curve) summarizes the ROC curve in a single number, representing
the likelihood that the model will rank a randomly chosen positive instance higher than a randomly
chosen negative instance.
ROC Curve
The ROC curve is a graphical representation of the trade-offs between TPR and FPR across
different classification thresholds. An ideal model will have a ROC curve that passes close to the top-
left corner of the plot, where TPR is 1 and FPR is 0.
AUC
The AUC represents the overall ability of the model to discriminate between positive and
negative classes. The value of AUC ranges from 0 to 1:
AUC < 0.5: Worse than random guessing, typically an indicator of a flawed model
A BRIEF REVIEW OF PROBABILITY THEORY
Probability theory is a branch of mathematics concerned with the analysis of random
phenomena. It provides a framework for quantifying the uncertainty inherent in various processes
and events. Here’s a brief overview of the key concepts in probability theory:
Basic Concepts
Experiment: An action or process that leads to one or more possible outcomes. For example,
rolling a die or flipping a coin.
Sample Space (S): The set of all possible outcomes of an experiment. For example, for a die
roll
Event (E): A subset of the sample space. An event consists of one or more outcomes. For
example, rolling an even number on a die (E = {2, 4, 6}).
Probability (P): A measure of the likelihood that an event will occur. The probability of an
event E is denoted by P(E) .It ranges from 0 (impossible event) to 1 (certain event).
Random Variables
A random variable is a function that assigns a real number to each outcome in the sample
space. There are two main types of random variables:
Discrete Random Variables: Take on a countable number of distinct values. For example, the number
of heads in 10 coin flips.
Continuous Random Variables: Take on an infinite number of possible values within a given range.
For example, the exact height of students in a class.
Probability Distributions
A probability distribution describes how the probabilities are distributed over the values of the
random variable.
Probability Mass Function (PMF): For discrete random variables, the PMF gives the probability that
the random variable takes a specific value.
Probability Density Function (PDF): For continuous random variables, the PDF describes the likelihood
of the random variable taking a value within a given range.
Cumulative Distribution Function (CDF): Gives the probability that the random variable is less than or
equal to a certain value.
TURNING DATA INTO PROBABILITIES
Turning data into probabilities is a crucial aspect of machine learning, particularly in probabilistic
models, classification, and prediction tasks. Here’s a guide on how to do it:
2. Data Preprocessing
Preprocessing the data involves cleaning, transforming, and organizing it to be suitable for analysis.
This can include:
Encoding Categorical Data: Converting categorical data into numerical form using techniques like one-
hot encoding.
3. Estimating Probabilities
1. Frequency-Based Methods
For categorical data, probabilities can be estimated using frequency counts. For example, in a
dataset of emails, the probability of an email being spam can be estimated by the proportion of spam
emails.
2. Density Estimation
For continuous data, probability density functions (PDFs) can be estimated using techniques
like:
Histogram: Dividing the data into bins and estimating the density by the height of the histogram.
Kernel Density Estimation (KDE): Smoothing the data using a kernel function to estimate the PDF.
3. Parametric Methods
Using parametric models involves assuming that the data follows a certain distribution (e.g., normal
distribution) and estimating the parameters of that distribution.
For example, if data is assumed to follow a normal distribution, the probability of a data point x can
be estimated using the normal distribution formula:
4. Probabilistic Models
1. Naive Bayes
Naive Bayes is a simple yet effective probabilistic classifier based on Bayes' theorem. It assumes
independence between features.
where
Despite the independence assumption, Naive Bayes often performs well in practice.
2. Logistic Regression
Logistic regression is used for binary classification and estimates the probability of the binary outcome
using the logistic function.
3. Bayesian Networks
Bayesian networks represent a set of variables and their conditional dependencies using a directed
acyclic graph (DAG). They are used to model complex probabilistic relationships.
Evaluation: Assess the performance of the model using metrics like log-likelihood, cross-entropy loss,
and accuracy. Use techniques like cross-validation to ensure the model generalizes well to unseen
data.
6. Making Predictions
Once the model is trained, it can be used to make probabilistic predictions. For example, given a new
data point, the model can estimate the probability of different outcomes (e.g., the probability that an
email is spam).
7. Practical Considerations
Model Assumptions: Ensure the assumptions of the chosen model are reasonable for your data.
Interpretability: Probabilistic models can provide interpretable outputs, such as the probability of a
class, which can be useful for decision-making.
By carefully preprocessing the data, choosing appropriate probabilistic models, and rigorously
training and evaluating these models, you can effectively turn data into meaningful probabilities in
machine learning tasks.
Bias: Bias refers to the error introduced by approximating a real-world problem, which may be
complex, by a simplified model. High bias means that the model makes strong assumptions about the
data, potentially oversimplifying it. This can lead to underfitting, where the model fails to capture
important patterns in the data.
Variance: Variance refers to the error introduced by the model's sensitivity to small fluctuations in the
training data. High variance means that the model is too complex and fits the training data too closely,
capturing noise as if it were a real pattern. This can lead to overfitting, where the model performs well
on training data but poorly on unseen data.
The tradeoff arises because decreasing bias typically increases variance, and vice versa. The
goal is to find a balance that minimizes the total error, which is the sum of bias and variance errors,
along with the irreducible error (error due to noise in the data that cannot be eliminated by any
model).
Illustration
High Bias, Low Variance: A simple model (e.g., linear regression with few features) that doesn't
capture the underlying complexity of the data.
Low Bias, High Variance: A complex model (e.g., a deep neural network with many parameters) that
captures noise and fluctuations in the training data.
Overfitting (High Variance): The model is too complex, leading to low training error but high test error.
Model Selection: Choose a model with the appropriate level of complexity for the given data.
Regularization: Techniques like L1 (Lasso) and L2 (Ridge) regularization add constraints to the model
parameters, preventing overfitting.
Cross-Validation: Use cross-validation techniques to assess model performance and avoid overfitting
by evaluating the model on multiple subsets of the data.
The find-S algorithm is a basic concept learning algorithm in machine learning. The find-S algorithm
finds the most specific hypothesis that fits all the positive examples. We have to note here that the
algorithm considers only those positive training example. The find-S algorithm starts with the most
specific hypothesis and generalizes this hypothesis each time it fails to classify an observed positive
training data. Hence, the Find-S algorithm moves from the most specific hypothesis to the most
general hypothesis.
Important Representation :
? indicates that any value is acceptable for the attribute.
specify a single required value ( e.g., Cold ) for the attribute.
ϕindicates that no value is acceptable.
The most general hypothesis is represented by: {?, ?, ?, ?, ?, ?}
The most specific hypothesis is represented by: {ϕ, ϕ, ϕ, ϕ, ϕ, ϕ}
1. h = {ϕ, ϕ, ϕ, ϕ, ϕ, ϕ}
2. Take the next example and if it is negative, then no changes occur to the hypothesis.
3. If the example is positive and we find that our initial hypothesis is too specific then we update
our current hypothesis to a general condition.
4. Keep repeating the above steps till all the training examples are complete.
5. After we have completed all the training examples we will have the final hypothesis when can
use to classify the new examples.
Example :
Consider the following data set having the data about which particular seeds are poisonous.
First, we consider the hypothesis to be a more specific hypothesis. Hence, our hypothesis would be :
h = {ϕ, ϕ, ϕ, ϕ, ϕ, ϕ}
Consider example 1 :
The data in example 1 is { GREEN, HARD, NO, WRINKLED }. We see that our initial hypothesis is more
specific and we have to generalize it for this example. Hence, the hypothesis becomes :
Consider example 2 :
Here we see that this example has a negative outcome. Hence we neglect this example and our
hypothesis remains the same.
Consider example 3 :
Here we see that this example has a negative outcome. Hence we neglect this example and our
hypothesis remains the same.
Consider example 4 :
The data present in example 4 is { ORANGE, HARD, NO, WRINKLED }. We compare every single
attribute with the initial data and if any mismatch is found we replace that particular attribute with a
general case ( ” ? ” ). After doing the process the hypothesis becomes :
Consider example 5 :
The data present in example 5 is { GREEN, SOFT, YES, SMOOTH }. We compare every single attribute
with the initial data and if any mismatch is found we replace that particular attribute with a general
case ( ” ? ” ). After doing the process the hypothesis becomes :
h = { ?, ?, ?, ? }
Since we have reached a point where all the attributes in our hypothesis have the general condition,
example 6 and example 7 would result in the same hypothesizes with all general attributes.
h = { ?, ?, ?, ? }
Final Hyposthesis: h = { ?, ?, ?, ? }
Algorithm :
Then do nothing
3. Output hypothesis h
A version space is a concept in machine learning that represents the set of all hypotheses
consistent with the observed training examples. Specifically, it is the subset of the hypothesis space
that remains plausible after observing the training data. The version space is bounded by two sets of
hypotheses.
Specific Boundary (S): This is the set of the most specific hypotheses that are consistent with all
positive examples. A hypothesis is in the specific boundary if it describes as narrowly as possible all
the observed positive examples without including any negative examples.
General Boundary (G): This is the set of the most general hypotheses that are consistent with all
positive examples and exclude all negative examples. A hypothesis is in the general boundary if it
describes as broadly as possible the positive examples while excluding the negative ones.
The candidate elimination algorithm incrementally builds the version space given a hypothesis
space H and a set E of examples. The examples are added one by one; each example possibly shrinks
the version space by removing the hypotheses that are inconsistent with the example. The candidate
elimination algorithm does this by updating the general and specific boundary for each new example.
Terms Used:
Concept learning: Concept learning is basically the learning task of the machine (Learn by
Train data)
General Hypothesis: Not Specifying features to learn the machine.
G = {‘?’, ‘?’,’?’,’?’…}: Number of attributes
Specific Hypothesis: Specifying features to learn machine (Specific feature)
S= {‘pi’,’pi’,’pi’…}: The number of pi depends on a number of attributes.
Version Space: It is an intermediate of general hypothesis and Specific hypothesis. It not only
just writes one hypothesis but a set of all possible hypotheses based on training data-set.
Algorithm:
Step1: Load Data set
if attribute_value == hypothesis_value:
Do nothing
else:
Example:
Algorithmic steps:
The Candidate Elimination Algorithm (CEA) is an improvement over the Find-S algorithm for
classification tasks. While CEA shares some similarities with Find-S, it also has some essential
differences that offer advantages and disadvantages. Here are some advantages and disadvantages of
CEA in comparison with Find-S:
More complex: CEA is a more complex algorithm than Find-S, which may make it more difficult
for beginners or those without a strong background in machine learning to use and
understand.
Higher memory requirements: CEA requires more memory to store the set of hypotheses and
boundaries, which may make it less suitable for memory-constrained environments.
Slower processing for large datasets: CEA may become slower for larger datasets due to the
increased number of hypotheses generated.
Higher potential for overfitting: The increased complexity of CEA may make it more prone to
overfitting on the training data, especially if the dataset is small or has a high degree of noise.
Statistical concepts
Mean (Average): The sum of all values divided by the number of values. It represents the central
tendency of the data.
Median: The middle value in a data set when the values are arranged in ascending order. It is a
measure of central tendency that is less affected by outliers.
Mode: The value that appears most frequently in a data set. A data set can have one mode, more than
one mode, or no mode.
Variance: A measure of how much the values in a data set deviate from the mean. It is calculated as
the average of the squared differences from the mean.
Standard Deviation: The square root of the variance. It provides a measure of the spread or dispersion
of a set of data.
Range: The difference between the maximum and minimum values in a data set.
Skewness: A measure of the asymmetry of the distribution of values in a data set. Positive skewness
indicates a distribution with a longer tail on the right, while negative skewness indicates a longer tail
on the left.
Kurtosis: A measure of the "tailedness" of the distribution. High kurtosis means more data is in the
tails, while low kurtosis means less data is in the tails.