0% found this document useful (0 votes)
2 views31 pages

Machine Learning -UNIT I Notes

The document provides an overview of machine learning (ML), a branch of artificial intelligence that enables computers to learn from data and improve over time. It discusses various types of ML, including supervised and unsupervised learning, along with their applications, algorithms, and challenges. Key concepts such as training data, model evaluation, and the machine learning life cycle are also highlighted.

Uploaded by

Dhanish Ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views31 pages

Machine Learning -UNIT I Notes

The document provides an overview of machine learning (ML), a branch of artificial intelligence that enables computers to learn from data and improve over time. It discusses various types of ML, including supervised and unsupervised learning, along with their applications, algorithms, and challenges. Key concepts such as training data, model evaluation, and the machine learning life cycle are also highlighted.

Uploaded by

Dhanish Ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

UNIT – I

INTRODUCTION
Machine Learning

Machine learning (ML) is a branch of artificial intelligence (AI) and computer science that
focuses on the using data and algorithms to enable AI to imitate the way that humans learn, gradually
improving its accuracy.

i.e Machine Learning is a subarea of artificial intelligence and it is the study of algorithms that
give the computers the ability to learn and make decisions based on data and not from explicit
instructions a popular example is learning to predict weather an email is spam or no spam by many
different emails of these two types.

Types of machine learning

Machine learning is broadly categorized into several types based on the nature of the learning process
and the type of data used. The primary types include:

 Supervised Learning: Learning with labeled data.


 Unsupervised Learning: Learning with unlabeled data.
 Semi-Supervised Learning: Learning with a combination of labeled and unlabeled data.
 Reinforcement Learning: Learning through interaction with an environment and receiving
feedback.
 Deep Learning: Using deep neural networks for complex pattern recognition.
 Transfer Learning: Applying knowledge gained from one problem to a different but related
problem.

SUPERVISED LEARNING
Supervised learning is the types of machine learning in which machines are trained using well
"labelled" training data, and on basis of that data, machines predict the output. The labelled data
means some input data is already tagged with the correct output.

In supervised learning, the training data provided to the machines work as the supervisor that
teaches the machines to predict the output correctly. It applies the same concept as a student learns
in the supervision of the teacher.

Supervised learning is a process of providing input data as well as correct output data to the
machine learning model. The aim of a supervised learning algorithm is to find a mapping function to
map the input variable(x) with the output variable(y).

In the real-world, supervised learning can be used for Risk Assessment, Image classification,
Fraud Detection, spam filtering, etc.

How Supervised Learning Works?

In supervised learning, models are trained using labelled dataset, where the model learns
about each type of data. Once the training process is completed, the model is tested on the basis of
test data and then it predicts the output.

The working of Supervised learning can be easily understood by the below example and diagram:
Suppose we have a dataset of different types of shapes which includes square, rectangle,
triangle, and Polygon. Now the first step is that we need to train the model for each shape.

 If the given shape has four sides, and all the sides are equal, then it will be labelled as a Square.
 If the given shape has three sides, then it will be labelled as a triangle.
 If the given shape has six equal sides then it will be labelled as hexagon.

Now, after training, we test our model using the test set, and the task of the model is to identify
the shape. The machine is already trained on all types of shapes, and when it finds a new shape, it
classifies the shape on the bases of a number of sides, and predicts the output.

Key Concepts
Training Data: A dataset containing input-output pairs, where each input (feature set) is associated
with an output (label).

Features: The attributes or properties of the data used as input for the model.

Labels: The output values or categories that the model aims to predict.

Model: The function or algorithm that is trained to make predictions.

Loss Function: A mathematical function that measures the difference between the predicted output
and the actual output. The goal is to minimize this difference.

Optimization Algorithm: A method used to minimize the loss function by adjusting the model's
parameters (e.g., weights in a neural network).

Steps Involved in Supervised Learning:


 First Determine the type of training dataset
 Collect/Gather the labelled training data.
 Split the training dataset into training dataset, test dataset, and validation dataset.
 Determine the input features of the training dataset, which should have enough knowledge
so that the model can accurately predict the output.
 Determine the suitable algorithm for the model, such as support vector machine, decision
tree, etc.
 Execute the algorithm on the training dataset. Sometimes we need validation sets as the
control parameters, which are the subset of training datasets.
 Evaluate the accuracy of the model by providing the test set. If the model predicts the correct
output, which means our model is accurate.

Types of Supervised Learning

Classification: The task of predicting a discrete label or category for an input. The output is typically a
finite set of classes.

 Binary Classification: Predicts one of two possible classes (e.g., spam or not spam).
 Multiclass Classification: Predicts one of more than two classes (e.g., classifying types of
animals in images).

Regression: The task of predicting a continuous value for an input. The output is a real number, and
the goal is to predict quantities. Examples: Predicting house prices, estimating temperatures,
forecasting stock prices.

Common Algorithms
Linear Regression: Models the relationship between input features and a continuous output using a
linear function.

Logistic Regression: Used for binary classification problems; it models the probability of an instance
belonging to a particular class.

Decision Trees: A model that splits the data into subsets based on feature values, forming a tree
structure.

Random Forests: An ensemble method that combines multiple decision trees to improve accuracy
and reduce overfitting.

Support Vector Machines (SVM): Finds the optimal hyperplane that separates classes in the feature
space.

K-Nearest Neighbors (KNN): Classifies a data point based on the majority class among its k nearest
neighbors.
Naive Bayes: A probabilistic classifier based on Bayes' theorem, assuming independence between
features.

Neural Networks: Computational models inspired by the human brain, used for both classification and
regression tasks, especially effective for complex problems.

Steps in Supervised Learning


Data Collection: Gather a labeled dataset relevant to the problem.

Data Preprocessing: Clean and prepare the data (e.g., handling missing values, normalizing features).

Feature Engineering: Select and transform features to improve model performance.

Model Selection: Choose an appropriate algorithm based on the nature of the task (classification or
regression).

Training: Train the model using the training data by optimizing the loss function.

Validation: Evaluate the model's performance using a separate validation dataset to fine-tune
hyperparameters and prevent overfitting.

Testing: Assess the final model's performance on a test dataset to evaluate its generalization ability.

Deployment: Deploy the trained model to make predictions on new, unseen data.

Applications
Healthcare: Diagnosing diseases, predicting patient outcomes.

Finance: Credit scoring, fraud detection, stock market prediction.

Marketing: Customer segmentation, predicting customer churn, personalized recommendations.

Natural Language Processing: Sentiment analysis, spam detection, language translation.

Computer Vision: Image and object recognition, facial recognition.

Challenges
Overfitting: When the model learns not only the underlying patterns but also the noise in the training
data, leading to poor performance on new data.

Underfitting: When the model is too simple to capture the underlying patterns in the data, resulting
in poor performance.

Data Quality: The quality and quantity of training data significantly affect the model's performance.

Model Interpretability: Some models, particularly complex ones like deep neural networks, can be
difficult to interpret.

Supervised learning is a foundational approach in machine learning, widely used across


various industries and applications due to its ability to predict outcomes based on historical data.
UNSUPERVISED LEARNING
In the previous topic, we learned supervised machine learning in which models are trained
using labeled data under the supervision of training data. But there may be many cases in which we
do not have labeled data and need to find the hidden patterns from the given dataset. So, to solve
such types of cases in machine learning, we need unsupervised learning techniques.

What is Unsupervised Learning?

As the name suggests, unsupervised learning is a machine learning technique in which models
are not supervised using training dataset. Instead, models itself find the hidden patterns and insights
from the given data. It can be compared to learning which takes place in the human brain while
learning new things. It can be defined as:

Unsupervised learning is a type of machine learning where the model is trained on data that
is not labeled. Unlike supervised learning, there is no explicit output associated with each input in the
training dataset. The goal of unsupervised learning is to discover hidden patterns, relationships, or
structures within the data.

Unsupervised learning cannot be directly applied to a regression or classification problem


because unlike supervised learning, we have the input data but no corresponding output data. The
goal of unsupervised learning is to find the underlying structure of dataset, group that data according
to similarities, and represent that dataset in a compressed format.

Unsupervised learning cannot be directly applied to a regression or classification problem


because unlike supervised learning, we have the input data but no corresponding output data. The
goal of unsupervised learning is to find the underlying structure of dataset, group that data according
to similarities, and represent that dataset in a compressed format.

Example: Suppose the unsupervised learning algorithm is given an input dataset containing
images of different types of cats and dogs. The algorithm is never trained upon the given dataset,
which means it does not have any idea about the features of the dataset. The task of the unsupervised
learning algorithm is to identify the image features on their own. Unsupervised learning algorithm will
perform this task by clustering the image dataset into the groups according to similarities between
images.

Why use Unsupervised Learning?

Below are some main reasons which describe the importance of Unsupervised Learning:

 Unsupervised learning is helpful for finding useful insights from the data.
 Unsupervised learning is much similar as a human learns to think by their own experiences,
which makes it closer to the real AI.
 Unsupervised learning works on unlabeled and uncategorized data which make unsupervised
learning more important.
 In real-world, we do not always have input data with the corresponding output so to solve
such cases, we need unsupervised learning.

Working of Unsupervised Learning

Working of unsupervised learning can be understood by the below diagram:


Here, we have taken an unlabeled input data, which means it is not categorized and
corresponding outputs are also not given. Now, this unlabeled input data is fed to the machine
learning model in order to train it. Firstly, it will interpret the raw data to find the hidden patterns
from the data and then will apply suitable algorithms such as k-means clustering, Decision tree, etc.

Once it applies the suitable algorithm, the algorithm divides the data objects into groups
according to the similarities and difference between the objects.

Types of Unsupervised Learning Algorithm:

The unsupervised learning algorithm can be further categorized into two types of problems:

Clustering: Clustering is a method of grouping the objects into clusters such that objects with most
similarities remains into a group and has less or no similarities with the objects of another group.
Cluster analysis finds the commonalities between the data objects and categorizes them as per the
presence and absence of those commonalities.

Association: An association rule is an unsupervised learning method which is used for finding the
relationships between variables in the large database. It determines the set of items that occurs
together in the dataset. Association rule makes marketing strategy more effective. Such as people
who buy X item (suppose a bread) are also tend to purchase Y (Butter/Jam) item. A typical example of
Association rule is Market Basket Analysis.

Unsupervised Learning algorithms:

Below is the list of some popular unsupervised learning algorithms:

 K-means clustering
 KNN (k-nearest neighbors)
 Hierarchal clustering
 Anomaly detection
 Neural Networks
 Principle Component Analysis
 Independent Component Analysis
 Apriori algorithm
 Singular value decomposition

Key Concepts

 Unlabeled Data: The dataset used for training contains only input features without
corresponding output labels.
 Clustering: Grouping similar data points together based on their features.
 Association: Discovering interesting relationships between variables in large datasets.
 Dimensionality Reduction: Reducing the number of input variables while retaining the
essential information.

Advantages of Unsupervised Learning

 Unsupervised learning is used for more complex tasks as compared to supervised learning
because, in unsupervised learning, we don't have labeled input data.
 Unsupervised learning is preferable as it is easy to get unlabeled data in comparison to labeled
data.

Disadvantages of Unsupervised Learning

 Unsupervised learning is intrinsically more difficult than supervised learning as it does not
have corresponding output.
 The result of the unsupervised learning algorithm might be less accurate as input data is not
labeled, and algorithms do not know the exact output in advance.

BASIC CONCEPTS IN MACHINE LEARNING


Machine learning (ML) is a branch of artificial intelligence (AI) that involves the development
of algorithms that allow computers to learn from and make decisions based on data. Here are some
of the basic concepts in machine learning:
Supervised Learning
Definition: A type of machine learning where the algorithm is trained on labeled data, meaning that
each training example is paired with an output label.

Examples: Classification (e.g., spam detection in emails), regression (e.g., predicting house prices).

Common Algorithms: Linear regression, decision trees, support vector machines, neural networks.

Unsupervised Learning
Definition: In unsupervised learning, the algorithm is trained on unlabeled data and must find patterns
or structures within the data on its own.

Examples: Clustering (e.g., customer segmentation), association (e.g., market basket analysis).

Common Algorithms: K-means clustering, hierarchical clustering, principal component analysis (PCA).

Reinforcement Learning
Definition: A type of learning where an agent interacts with an environment and learns to make
decisions by receiving rewards or penalties.

Examples: Game playing (e.g., chess, Go), robotics (e.g., teaching robots to walk).

Common Algorithms: Q-learning, deep Q-networks (DQN), policy gradients.

Training and Testing


Training: The process of teaching an algorithm by feeding it data. The model learns patterns from this
data.

Testing: Evaluating the performance of the trained model on a separate dataset that it hasn’t seen
during training. This helps assess how well the model generalizes to new data.

Overfitting and Underfitting


Overfitting: When a model learns not only the underlying patterns but also the noise in the training
data, leading to poor performance on new data.

Underfitting: When a model is too simple to capture the underlying patterns in the data, resulting in
poor performance even on the training data.

Feature Engineering
Definition: The process of selecting, modifying, and creating features (input variables) that will be used
by the machine learning model.

Importance: Good feature engineering can significantly improve the performance of a model.

Model Evaluation Metrics


Accuracy: The percentage of correct predictions made by the model (used mostly in classification).

Precision and Recall: Precision is the number of true positive results divided by the number of all
positive results, while recall is the number of true positive results divided by the number of positives
that should have been retrieved.
F1 Score: The harmonic mean of precision and recall, providing a balance between the two.

Mean Squared Error (MSE): Used in regression tasks, it measures the average squared difference
between the predicted and actual values.

Cross-Validation
Definition: A technique for evaluating the performance of a model by dividing the data into multiple
subsets and training/testing the model multiple times with different subsets.

Purpose: Helps ensure that the model’s performance is not dependent on a particular division of the
data.

Bias-Variance Tradeoff
Bias: Error due to overly simplistic models that do not capture the underlying patterns in the data.

Variance: Error due to models that are too complex and sensitive to small fluctuations in the training
data.

Tradeoff: A balance must be struck between bias and variance to achieve the best generalization
performance.

Hyperparameters and Tuning


Definition: Hyperparameters are settings that control the learning process (e.g., learning rate, number
of trees in a random forest).

Tuning: The process of finding the best hyperparameters for a model, often done using techniques
like grid search or random search.

These concepts form the foundation of machine learning, guiding how algorithms are developed and
applied to solve real-world problems.

MACHINE LEARNING PROCESS (MACHINE LEARNING LIFE CYCLE)


Machine learning has given the computer systems the abilities to automatically learn without being
explicitly programmed. But how does a machine learning system work? So, it can be described using
the life cycle of machine learning. Machine learning life cycle is a cyclic process to build an efficient
machine learning project. The main purpose of the life cycle is to find a solution to the problem or
project.

Machine learning life cycle involves seven major steps, which are given below:

 Gathering Data
 Data preparation
 Data Wrangling
 Analyse Data
 Train the model
 Test the model
 Deployment
The most important thing in the complete process is to understand the problem and to know the
purpose of the problem. Therefore, before starting the life cycle, we need to understand the problem
because the good result depends on the better understanding of the problem.

In the complete life cycle process, to solve a problem, we create a machine learning system called
"model", and this model is created by providing "training". But to train a model, we need data, hence,
life cycle starts by collecting data.

Gathering Data:
Data Gathering is the first step of the machine learning life cycle. The goal of this step is to
identify and obtain all data-related problems.

In this step, we need to identify the different data sources, as data can be collected from
various sources such as files, database, internet, or mobile devices. It is one of the most important
steps of the life cycle. The quantity and quality of the collected data will determine the efficiency of
the output. The more will be the data, the more accurate will be the prediction.

This step includes the below tasks:

 Identify various data sources


 Collect data
 Integrate the data obtained from different sources

By performing the above task, we get a coherent set of data, also called as a dataset. It will be used in
further steps.
Data preparation
After collecting the data, we need to prepare it for further steps. Data preparation is a step where we
put our data into a suitable place and prepare it to use in our machine learning training.

In this step, first, we put all data together, and then randomize the ordering of data.

This step can be further divided into two processes:

Data exploration:

It is used to understand the nature of data that we have to work with. We need to understand the
characteristics, format, and quality of data.

A better understanding of data leads to an effective outcome. In this, we find Correlations, general
trends, and outliers.

Data pre-processing:

Now the next step is preprocessing of data for its analysis.

Data Wrangling
Data wrangling is the process of cleaning and converting raw data into a useable format. It is
the process of cleaning the data, selecting the variable to use, and transforming the data in a proper
format to make it more suitable for analysis in the next step. It is one of the most important steps of
the complete process. Cleaning of data is required to address the quality issues.

It is not necessary that data we have collected is always of our use as some of the data may
not be useful. In real-world applications, collected data may have various issues, including:

 Missing Values
 Duplicate data
 Invalid data
 Noise

So, we use various filtering techniques to clean the data.

It is mandatory to detect and remove the above issues because it can negatively affect the quality of
the outcome.

Data Analysis
Now the cleaned and prepared data is passed on to the analysis step. This step involves:

 Selection of analytical techniques


 Building models
 Review the result

The aim of this step is to build a machine learning model to analyze the data using various
analytical techniques and review the outcome. It starts with the determination of the type of the
problems, where we select the machine learning techniques such as Classification, Regression, Cluster
analysis, Association, etc. then build the model using prepared data, and evaluate the model.

Hence, in this step, we take the data and use machine learning algorithms to build the model.
Train Model
Now the next step is to train the model, in this step we train our model to improve its
performance for better outcome of the problem.

We use datasets to train the model using various machine learning algorithms. Training a
model is required so that it can understand the various patterns, rules, and, features.

Test Model
Once our machine learning model has been trained on a given dataset, then we test the
model. In this step, we check for the accuracy of our model by providing a test dataset to it.

Testing the model determines the percentage accuracy of the model as per the requirement
of project or problem.

Deployment
The last step of machine learning life cycle is deployment, where we deploy the model in the
real-world system.

If the above-prepared model is producing an accurate result as per our requirement with
acceptable speed, then we deploy the model in the real system. But before deploying the project, we
will check whether it is improving its performance using available data or not. The deployment phase
is similar to making the final report for a project.

WEIGHT SPACE
In the context of machine learning, "weight space" refers to the multidimensional space defined by
the weights of a model, particularly in neural networks. Each point in this space represents a specific
configuration of the model's parameters, and the objective of training is to find an optimal point in
this space that minimizes the loss function (i.e., the error between the model's predictions and the
actual outcomes).

Key Concepts:
Dimensionality: The dimensionality of the weight space is determined by the number of parameters
(weights) in the model. For example, if a neural network has 10,000 parameters, the weight space is
a 10,000-dimensional space.

Loss Surface: The loss function, when plotted over the weight space, forms a surface known as the
loss surface. The goal of training is to find the lowest point on this surface, corresponding to the
minimum loss.

Gradient Descent: This is a common optimization technique used to navigate the weight space. The
algorithm iteratively adjusts the weights by moving in the direction of the negative gradient of the loss
function to reach a minimum.
Local vs. Global Minima: The weight space can have multiple minima, where the loss function has a
lower value. A global minimum is the absolute lowest point on the loss surface, while local minima are
points where the loss is lower than the surrounding points but not the lowest overall.

Flatness and Sharpness: In weight space, the nature of the minima can vary. Flat minima are regions
where small changes in the weights do not significantly increase the loss, which often correlates with
better generalization. Sharp minima are regions where small changes can lead to a significant increase
in loss, often leading to poorer generalization.

Visualizing Weight Space

Visualizing weight space is challenging due to its high dimensionality, but techniques like PCA (Principal
Component Analysis) can reduce the dimensionality to visualize important structures, such as the path
the optimizer takes during training.

Weight space is crucial for understanding how neural networks learn and how different configurations
of weights affect the performance of the model.

TESTING MACHINE LEARNING ALGORITHMS


Testing machine learning algorithms is a crucial step in developing and deploying models. It ensures
that the model performs well on unseen data and can generalize effectively to new situations. Here’s
an overview of the key concepts and methods involved in testing machine learning algorithms:

Data Splitting
Training Set: The subset of data used to train the model. The model learns the patterns in this data.

Validation Set: A subset of data used to tune model parameters. It helps prevent overfitting by
validating the model’s performance during training.

Test Set: A separate subset of data used to evaluate the model’s performance after it has been trained.
This set should only be used once to assess the model's final performance.

Cross-validation
Cross-validation is a technique used to assess the performance and generalizability of a
machine learning model. It involves partitioning the data into subsets, training the model on some
subsets while testing it on the remaining subsets, and then repeating this process multiple times to
ensure that the model performs well across different portions of the data. This helps in avoiding
overfitting and provides a more accurate estimate of a model’s performance on unseen data.

Why Use Cross-Validation?

Assess Model Performance: It provides a robust estimate of the model's performance by training and
testing it on different splits of the data.

Avoid Overfitting: Helps ensure that the model generalizes well to new, unseen data and is not just
performing well on the specific data it was trained on.

Efficient Use of Data: Allows for the efficient use of the entire dataset for both training and validation.
There are many types of cross validation techniques are available. Among this K-Fold Cross
validation is important.

K-fold cross-validation
K-fold cross-validation is a robust method for evaluating the performance of a machine
learning model. It helps ensure that the model's performance is not dependent on a particular train-
test split, providing a more generalizable measure of model accuracy. Here's a detailed explanation
and example code for k-fold cross-validation:

How K-Fold Cross-Validation Works


Divide the Dataset: The dataset is divided into k equally (or almost equally) sized folds.

Training and Validation: For each fold:

 A model is trained using k-1 folds.


 The model is validated on the remaining fold (the one not used for training).

Repeat: This process is repeated k times, each time with a different fold as the validation set.

Average Performance: The performance metric (e.g., accuracy, precision, recall) is averaged across all
k trials to provide a more robust estimate of the model's performance.

Benefits of K-Fold Cross-Validation


More Reliable Estimate: Provides a more reliable estimate of model performance by averaging results
across multiple splits.

Efficient Use of Data: Utilizes the entire dataset for both training and validation, making the most of
available data.

Less Bias: Reduces the bias associated with a single train-test split.

Considerations
Choice of k: Common choices are 5 or 10, but this depends on the dataset size and the specific use
case.

Computation Cost: Increases computational cost by training the model k times, which can be
significant for large datasets or complex models.

Confusion Matrix
A confusion matrix is a table used to evaluate the performance of a classification model by
summarizing the counts of correct and incorrect predictions for each class. It provides a detailed
breakdown of how well the model is performing, revealing not only the accuracy but also the types of
errors being made.

Structure
For a binary classification problem, the confusion matrix is a 2x2 table with the following
structure:
Definitions
 True Positive (TP): The number of instances correctly predicted as positive.
 False Positive (FP): The number of instances incorrectly predicted as positive (type I error).
 True Negative (TN): The number of instances correctly predicted as negative.
 False Negative (FN): The number of instances incorrectly predicted as negative (type II error).

Example Calculation

Consider a binary classification problem with the following outcomes:

 Actual Positive Instances: 60


 Actual Negative Instances: 40

Model predictions:

True Positives (TP): 50

False Positives (FP): 5

True Negatives (TN): 35

False Negatives (FN): 10

Performance Metrics Derived from Confusion Matrix


Accuracy
Accuracy is one of the most common metrics used to evaluate the performance of a
classification model. It measures the proportion of correctly classified instances out of the total
instances in the dataset.

From the above Confusion Matrix

So, the accuracy is 85%.


Precision
Precision is a performance metric for classification models, particularly useful in binary classification
problems. It measures the accuracy of positive predictions made by the model. In other words,
precision is the proportion of true positive predictions out of all positive predictions (both true
positives and false positives).

From the above Confusion Matrix

So, the precision is approximately 91%.

Recall
Recall, also known as sensitivity or true positive rate, is a performance metric for classification models.
It measures the ability of the model to identify all relevant instances in a dataset, specifically focusing
on the proportion of actual positive instances that are correctly identified by the model.

From the above Confusion Matrix

So, the recall is approximately 83%.

F1 Score
The F1 score is a performance metric for classification models that combines precision and recall into
a single metric. It is the harmonic mean of precision and recall, providing a balance between the two
metrics. The F1 score is particularly useful when you need to account for both false positives and false
negatives and when dealing with imbalanced datasets.

From the above Confusion Matrix


Specificity
Specificity, also known as the true negative rate, is a performance metric for classification
models, particularly in binary classification. It measures the proportion of actual negative instances
that are correctly identified by the model. In other words, it evaluates how well the model identifies
negative cases.

Where:

 TN (True Negatives): The number of correct negative predictions.


 FP (False Positives): The number of incorrect positive predictions.

From the above Confusion Matrix

So, the specificity is 0.875, or 87.5%.

True Positive Rate


The True Positive Rate (TPR), also known as sensitivity, recall, or hit rate, is a performance metric used
to evaluate the effectiveness of a classification model, particularly in binary classification tasks. TPR
measures the proportion of actual positive instances that are correctly identified by the model.

Where:

 TP (True Positives): The number of correct positive predictions.


 FN (False Negatives): The number of actual positive instances that were incorrectly predicted
as negative.

From the above Confusion Matrix

So, the TPR is approximately 0.833, or 83.3%.

False Positive Rate


The False Positive Rate (FPR), also known as the fall-out, is a performance metric used to evaluate the
effectiveness of a classification model, particularly in binary classification tasks. FPR measures the
proportion of actual negative instances that are incorrectly classified as positive by the model.
Where:

FP (False Positives): The number of actual negative instances that were incorrectly predicted as
positive.

TN (True Negatives): The number of correct negative predictions.

From the above Confusion Matrix

So, the FPR is 0.125, or 12.5%.

ROC AUC
The ROC AUC (Receiver Operating Characteristic - Area Under the Curve) is a performance
metric for evaluating the performance of binary classification models. The ROC curve plots the True
Positive Rate (TPR, also known as recall) against the False Positive Rate (FPR) at various threshold
settings. The AUC (Area Under the Curve) summarizes the ROC curve in a single number, representing
the likelihood that the model will rank a randomly chosen positive instance higher than a randomly
chosen negative instance.

ROC Curve

The ROC curve is a graphical representation of the trade-offs between TPR and FPR across
different classification thresholds. An ideal model will have a ROC curve that passes close to the top-
left corner of the plot, where TPR is 1 and FPR is 0.

AUC

The AUC represents the overall ability of the model to discriminate between positive and
negative classes. The value of AUC ranges from 0 to 1:

AUC = 1: Perfect model.

AUC = 0.5: Model with no discrimination ability, equivalent to random guessing.

AUC < 0.5: Worse than random guessing, typically an indicator of a flawed model
A BRIEF REVIEW OF PROBABILITY THEORY
Probability theory is a branch of mathematics concerned with the analysis of random
phenomena. It provides a framework for quantifying the uncertainty inherent in various processes
and events. Here’s a brief overview of the key concepts in probability theory:

Basic Concepts
 Experiment: An action or process that leads to one or more possible outcomes. For example,
rolling a die or flipping a coin.
 Sample Space (S): The set of all possible outcomes of an experiment. For example, for a die
roll


 Event (E): A subset of the sample space. An event consists of one or more outcomes. For
example, rolling an even number on a die (E = {2, 4, 6}).
 Probability (P): A measure of the likelihood that an event will occur. The probability of an
event E is denoted by P(E) .It ranges from 0 (impossible event) to 1 (certain event).
Random Variables
A random variable is a function that assigns a real number to each outcome in the sample
space. There are two main types of random variables:

Discrete Random Variables: Take on a countable number of distinct values. For example, the number
of heads in 10 coin flips.

Continuous Random Variables: Take on an infinite number of possible values within a given range.
For example, the exact height of students in a class.

Probability Distributions
A probability distribution describes how the probabilities are distributed over the values of the
random variable.

Probability Mass Function (PMF): For discrete random variables, the PMF gives the probability that
the random variable takes a specific value.

Probability Density Function (PDF): For continuous random variables, the PDF describes the likelihood
of the random variable taking a value within a given range.

Cumulative Distribution Function (CDF): Gives the probability that the random variable is less than or
equal to a certain value.
TURNING DATA INTO PROBABILITIES
Turning data into probabilities is a crucial aspect of machine learning, particularly in probabilistic
models, classification, and prediction tasks. Here’s a guide on how to do it:

1. Understanding the Data


Before turning data into probabilities, it is important to understand the type of data you have:

Categorical Data: Discrete values representing categories.

Numerical Data: Continuous or discrete numerical values.

Text Data: Unstructured data in the form of text.

2. Data Preprocessing
Preprocessing the data involves cleaning, transforming, and organizing it to be suitable for analysis.
This can include:

Handling Missing Values: Filling in missing values or removing incomplete records.

Normalizing/Standardizing: Adjusting the scale of numerical data.

Encoding Categorical Data: Converting categorical data into numerical form using techniques like one-
hot encoding.

3. Estimating Probabilities
1. Frequency-Based Methods
For categorical data, probabilities can be estimated using frequency counts. For example, in a
dataset of emails, the probability of an email being spam can be estimated by the proportion of spam
emails.

2. Density Estimation
For continuous data, probability density functions (PDFs) can be estimated using techniques
like:

Histogram: Dividing the data into bins and estimating the density by the height of the histogram.

Kernel Density Estimation (KDE): Smoothing the data using a kernel function to estimate the PDF.

3. Parametric Methods
Using parametric models involves assuming that the data follows a certain distribution (e.g., normal
distribution) and estimating the parameters of that distribution.

For example, if data is assumed to follow a normal distribution, the probability of a data point x can
be estimated using the normal distribution formula:
4. Probabilistic Models
1. Naive Bayes
Naive Bayes is a simple yet effective probabilistic classifier based on Bayes' theorem. It assumes
independence between features.

where

 Y is the class variable and


 X is the feature vector.

Despite the independence assumption, Naive Bayes often performs well in practice.

2. Logistic Regression
Logistic regression is used for binary classification and estimates the probability of the binary outcome
using the logistic function.

where β are the coefficients to be estimated.

3. Bayesian Networks
Bayesian networks represent a set of variables and their conditional dependencies using a directed
acyclic graph (DAG). They are used to model complex probabilistic relationships.

5. Training and Evaluation


Training: Fit the chosen probabilistic model to the training data. This involves estimating the
parameters that maximize the likelihood of the observed data.

Evaluation: Assess the performance of the model using metrics like log-likelihood, cross-entropy loss,
and accuracy. Use techniques like cross-validation to ensure the model generalizes well to unseen
data.

6. Making Predictions
Once the model is trained, it can be used to make probabilistic predictions. For example, given a new
data point, the model can estimate the probability of different outcomes (e.g., the probability that an
email is spam).
7. Practical Considerations
Model Assumptions: Ensure the assumptions of the chosen model are reasonable for your data.

Computational Efficiency: Some methods can be computationally expensive; choose an appropriate


method based on the size and complexity of the data.

Interpretability: Probabilistic models can provide interpretable outputs, such as the probability of a
class, which can be useful for decision-making.

By carefully preprocessing the data, choosing appropriate probabilistic models, and rigorously
training and evaluating these models, you can effectively turn data into meaningful probabilities in
machine learning tasks.

The Bias-Variance Trade-off


The bias-variance tradeoff is a fundamental concept in machine learning and statistical
modeling that describes the balance between two sources of error that affect the performance of
predictive models:

Bias: Bias refers to the error introduced by approximating a real-world problem, which may be
complex, by a simplified model. High bias means that the model makes strong assumptions about the
data, potentially oversimplifying it. This can lead to underfitting, where the model fails to capture
important patterns in the data.

Variance: Variance refers to the error introduced by the model's sensitivity to small fluctuations in the
training data. High variance means that the model is too complex and fits the training data too closely,
capturing noise as if it were a real pattern. This can lead to overfitting, where the model performs well
on training data but poorly on unseen data.
The tradeoff arises because decreasing bias typically increases variance, and vice versa. The
goal is to find a balance that minimizes the total error, which is the sum of bias and variance errors,
along with the irreducible error (error due to noise in the data that cannot be eliminated by any
model).

Illustration
High Bias, Low Variance: A simple model (e.g., linear regression with few features) that doesn't
capture the underlying complexity of the data.

Low Bias, High Variance: A complex model (e.g., a deep neural network with many parameters) that
captures noise and fluctuations in the training data.

Model Complexity and Error


Underfitting (High Bias): The model is too simple, leading to high training error and high test error.

Overfitting (High Variance): The model is too complex, leading to low training error but high test error.

Balancing Bias and Variance


To achieve a good tradeoff:

Model Selection: Choose a model with the appropriate level of complexity for the given data.

Regularization: Techniques like L1 (Lasso) and L2 (Ridge) regularization add constraints to the model
parameters, preventing overfitting.

Cross-Validation: Use cross-validation techniques to assess model performance and avoid overfitting
by evaluating the model on multiple subsets of the data.

Bias Variance Tradeoff


If the algorithm is too simple (hypothesis with linear equation) then it may be on high bias and
low variance condition and thus is error-prone. If algorithms fit too complex (hypothesis with high
degree equation) then it may be on high variance and low bias. In the latter condition, the new entries
will not perform well. Well, there is something between both of these conditions, known as a Trade-
off or Bias Variance Trade-off. This tradeoff in complexity is why there is a tradeoff between bias and
variance. An algorithm can’t be more complex and less complex at the same time. For the graph, the
perfect tradeoff will be like this.
FIND–S ALGORITHM
The Find-S algorithm is a simple machine learning algorithm used for learning a hypothesis from a set
of positive training examples. It is used in concept learning and is designed to find the most specific
hypothesis that fits the given positive training instances.

The find-S algorithm is a basic concept learning algorithm in machine learning. The find-S algorithm
finds the most specific hypothesis that fits all the positive examples. We have to note here that the
algorithm considers only those positive training example. The find-S algorithm starts with the most
specific hypothesis and generalizes this hypothesis each time it fails to classify an observed positive
training data. Hence, the Find-S algorithm moves from the most specific hypothesis to the most
general hypothesis.

Important Representation :
 ? indicates that any value is acceptable for the attribute.
 specify a single required value ( e.g., Cold ) for the attribute.
 ϕindicates that no value is acceptable.
 The most general hypothesis is represented by: {?, ?, ?, ?, ?, ?}
 The most specific hypothesis is represented by: {ϕ, ϕ, ϕ, ϕ, ϕ, ϕ}

Steps Involved In Find-S :


Start with the most specific hypothesis.

1. h = {ϕ, ϕ, ϕ, ϕ, ϕ, ϕ}
2. Take the next example and if it is negative, then no changes occur to the hypothesis.
3. If the example is positive and we find that our initial hypothesis is too specific then we update
our current hypothesis to a general condition.
4. Keep repeating the above steps till all the training examples are complete.
5. After we have completed all the training examples we will have the final hypothesis when can
use to classify the new examples.
Example :

Consider the following data set having the data about which particular seeds are poisonous.
First, we consider the hypothesis to be a more specific hypothesis. Hence, our hypothesis would be :

h = {ϕ, ϕ, ϕ, ϕ, ϕ, ϕ}

Consider example 1 :

The data in example 1 is { GREEN, HARD, NO, WRINKLED }. We see that our initial hypothesis is more
specific and we have to generalize it for this example. Hence, the hypothesis becomes :

h = { GREEN, HARD, NO, WRINKLED }

Consider example 2 :

Here we see that this example has a negative outcome. Hence we neglect this example and our
hypothesis remains the same.

h = { GREEN, HARD, NO, WRINKLED }

Consider example 3 :

Here we see that this example has a negative outcome. Hence we neglect this example and our
hypothesis remains the same.

h = { GREEN, HARD, NO, WRINKLED }

Consider example 4 :

The data present in example 4 is { ORANGE, HARD, NO, WRINKLED }. We compare every single
attribute with the initial data and if any mismatch is found we replace that particular attribute with a
general case ( ” ? ” ). After doing the process the hypothesis becomes :

h = { ?, HARD, NO, WRINKLED }

Consider example 5 :

The data present in example 5 is { GREEN, SOFT, YES, SMOOTH }. We compare every single attribute
with the initial data and if any mismatch is found we replace that particular attribute with a general
case ( ” ? ” ). After doing the process the hypothesis becomes :

h = { ?, ?, ?, ? }

Since we have reached a point where all the attributes in our hypothesis have the general condition,
example 6 and example 7 would result in the same hypothesizes with all general attributes.

h = { ?, ?, ?, ? }

Hence, for the given data the final hypothesis would be :

Final Hyposthesis: h = { ?, ?, ?, ? }

Algorithm :

1. Initialize h to the most specific hypothesis in H

2. For each positive training instance x

For each attribute constraint a, in h


If the constraint a, is satisfied by x

Then do nothing

Else replace a, in h by the next more general constraint that is satisfied by x

3. Output hypothesis h

CANDIDATE ELIMINATION ALGORITHM


The Candidate Elimination Algorithm is a machine learning algorithm used for concept
learning, which is the task of inferring a general concept from specific examples. It is a form of
supervised learning where the goal is to find a hypothesis that correctly classifies training examples
into a target concept.

A version space is a concept in machine learning that represents the set of all hypotheses
consistent with the observed training examples. Specifically, it is the subset of the hypothesis space
that remains plausible after observing the training data. The version space is bounded by two sets of
hypotheses.

Specific Boundary (S): This is the set of the most specific hypotheses that are consistent with all
positive examples. A hypothesis is in the specific boundary if it describes as narrowly as possible all
the observed positive examples without including any negative examples.

General Boundary (G): This is the set of the most general hypotheses that are consistent with all
positive examples and exclude all negative examples. A hypothesis is in the general boundary if it
describes as broadly as possible the positive examples while excluding the negative ones.

The candidate elimination algorithm incrementally builds the version space given a hypothesis
space H and a set E of examples. The examples are added one by one; each example possibly shrinks
the version space by removing the hypotheses that are inconsistent with the example. The candidate
elimination algorithm does this by updating the general and specific boundary for each new example.

 WE can consider this as an extended form of the Find-S algorithm.


 Consider both positive and negative examples.
 Actually, positive examples are used here as the Find-S algorithm (Basically they are
generalizing from the specification).
 While the negative example is specified in the generalizing form.

Terms Used:

 Concept learning: Concept learning is basically the learning task of the machine (Learn by
Train data)
 General Hypothesis: Not Specifying features to learn the machine.
 G = {‘?’, ‘?’,’?’,’?’…}: Number of attributes
 Specific Hypothesis: Specifying features to learn machine (Specific feature)
 S= {‘pi’,’pi’,’pi’…}: The number of pi depends on a number of attributes.
 Version Space: It is an intermediate of general hypothesis and Specific hypothesis. It not only
just writes one hypothesis but a set of all possible hypotheses based on training data-set.
Algorithm:
Step1: Load Data set

Step2: Initialize General Hypothesis and Specific Hypothesis.

Step3: For each training example

Step4: If example is positive example

if attribute_value == hypothesis_value:

Do nothing

else:

replace attribute value with '?' (Basically generalizing it)

Step5: If example is Negative example

Make generalize hypothesis more specific.

Example:

Consider the dataset given below:

Algorithmic steps:
The Candidate Elimination Algorithm (CEA) is an improvement over the Find-S algorithm for
classification tasks. While CEA shares some similarities with Find-S, it also has some essential
differences that offer advantages and disadvantages. Here are some advantages and disadvantages of
CEA in comparison with Find-S:

Advantages of CEA over Find-S:


 Improved accuracy: CEA considers both positive and negative examples to generate the
hypothesis, which can result in higher accuracy when dealing with noisy or incomplete data.
 Flexibility: CEA can handle more complex classification tasks, such as those with multiple
classes or non-linear decision boundaries.
 More efficient: CEA reduces the number of hypotheses by generating a set of general
hypotheses and then eliminating them one by one. This can result in faster processing and
improved efficiency.
 Better handling of continuous attributes: CEA can handle continuous attributes by creating
boundaries for each attribute, which makes it more suitable for a wider range of datasets.

Disadvantages of CEA in comparison with Find-S:

 More complex: CEA is a more complex algorithm than Find-S, which may make it more difficult
for beginners or those without a strong background in machine learning to use and
understand.
 Higher memory requirements: CEA requires more memory to store the set of hypotheses and
boundaries, which may make it less suitable for memory-constrained environments.
 Slower processing for large datasets: CEA may become slower for larger datasets due to the
increased number of hypotheses generated.
 Higher potential for overfitting: The increased complexity of CEA may make it more prone to
overfitting on the training data, especially if the dataset is small or has a high degree of noise.

Statistical concepts
Mean (Average): The sum of all values divided by the number of values. It represents the central
tendency of the data.

Median: The middle value in a data set when the values are arranged in ascending order. It is a
measure of central tendency that is less affected by outliers.

Mode: The value that appears most frequently in a data set. A data set can have one mode, more than
one mode, or no mode.

Variance: A measure of how much the values in a data set deviate from the mean. It is calculated as
the average of the squared differences from the mean.

Standard Deviation: The square root of the variance. It provides a measure of the spread or dispersion
of a set of data.

Range: The difference between the maximum and minimum values in a data set.

Skewness: A measure of the asymmetry of the distribution of values in a data set. Positive skewness
indicates a distribution with a longer tail on the right, while negative skewness indicates a longer tail
on the left.

Kurtosis: A measure of the "tailedness" of the distribution. High kurtosis means more data is in the
tails, while low kurtosis means less data is in the tails.

You might also like