0% found this document useful (0 votes)
82 views48 pages

Ocs351 Unit III

The document provides an overview of machine learning fundamentals, including definitions, types, and applications. It discusses various machine learning approaches such as supervised, unsupervised, semi-supervised, and reinforcement learning, along with classification and regression techniques. Additionally, it covers evaluation metrics for models and the importance of data quality in training algorithms.

Uploaded by

abinaya77757
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
82 views48 pages

Ocs351 Unit III

The document provides an overview of machine learning fundamentals, including definitions, types, and applications. It discusses various machine learning approaches such as supervised, unsupervised, semi-supervised, and reinforcement learning, along with classification and regression techniques. Additionally, it covers evaluation metrics for models and the importance of data quality in training algorithms.

Uploaded by

abinaya77757
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Page 1

OCS351 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING FUNDAMENTALS

UNIT III LEARNING


Machine Learning: Definitions – Classification - Regression - approaches of machine learning models - Types
of learning - Probability - Basics - Linear Algebra – Hypothesis space and inductive bias, Evaluation. Training
and test sets, cross validation, Concept of over fitting, under fitting, Bias and Variance - Regression: Linear
Regression – Logistic Regression

I. MACHINE LEARNING: DEFINITIONS


What is machine learning? Need – History and Definitions - Applications
→ Machine Learning is a branch of Artificial Intelligence that allows machines to learn and improve
from experience automatically.
→ It is defined as the field of study that gives computers the capability to learn without being explicitly
programmed. It is quite different than traditional programming.

→ AI (Artificial Intelligence) is a machine’s ability to perform cognitive functions as humans do, such
as perceiving, learning, reasoning, and solving problems. The benchmark for AI is the human level
concerning in teams of reasoning, speech, and vision.
→ Machine learning is important because it gives enterprises a view of trends in customer behavior
and business operational patterns, as well as supports the development of new products. Many of
today's leading companies, such as Facebook, Google and Uber, make machine learning a central
part of their operations.
The life of Machine Learning programs is straightforward and can be summarized in the following points:
1. Define a question
2. Collect data
3. Visualize data
4. Train algorithm
5. Test the Algorithm
6. Collect feedback

Prepared by S. Abinaya AP/CSE (SRRCET)


Page 2

7. Refine the algorithm


8. Loop 4−7 until the results are satisfying
TYPES OF MACHINE LEARNING
Types of machine learning?
→ Classical machine learning is often categorized by how an algorithm learns to become more accurate
in its [Link] are four basic approaches: supervised learning, unsupervised
learning, semi−supervised learning and reinforcement learning.
→ The type of algorithm data scientists choose to use depends on what type of data they want to predict.
Supervised learning
→ In this type of machine learning, data scientists supply algorithms with labeled training
data and define the variables they want the algorithm to assess for correlations.
→ Both the input and the output of the algorithm is specified.
Unsupervised learning
→ This type of machine learning involves algorithms that train on unlabeled data.
→ The algorithm scans through data sets looking for any meaningful connection. The data
that algorithms train on as well as the predictions or recommendations they output are
predetermined.
Semi-supervised learning
→ This approach to machine learning involves a mix of the two preceding types. Data
scientists may feed an algorithm mostly labeled training data, but the model is free to
explore the data on its own and develop its own understanding of the data set.
Reinforcement learning
→ Data scientists typically use reinforcement learning to teach a machine to complete a
multi−step process for which there are clearly defined rules.
→ Data scientists program an algorithm to complete a task and give it positive or negative
cues as it works out how to complete a task. But for the most part, the algorithm decides
on its own what steps to take along the way.
How does supervised machine learning work?:Supervised machine learning requires the data scientist to
train the algorithm with both labeled inputs and desired outputs. Supervised learning algorithms are good for
the following tasks:
Binary classification: Dividing data into two categories.
Multi-class classification: Choosing between more than two types of answers.
Regression modeling: Predicting continuous values.
Ensembling: Combining the predictions of multiple machine learning models to produce an
accurate prediction.

Prepared by S. Abinaya AP/CSE (SRRCET)


Page 3

II. CLASSIFICATION
Classification
→ As the name suggests, Classification is the task of “classifying things” into sub− categories. But, by a
machine! If that doesn’t sound like much, imagine your computer being able to differentiate between
you and a stranger. Between a potato and a tomato. Between an A grade and an F. Now, it sounds
interesting now.
→ In Machine Learning and Statistics, Classification is the problem of identifying to which of a set of
categories (subpopulations), a new observation belongs, on the basis of a training set of data containing
observations and whose categories membership is known.
Types of Classification
Classification is of two types:
Binary Classification: When we have to categorize given data into 2 distinct classes. Example – On
the basis of given health conditions of a person, we have to determine whether the person has a certain
disease or not.
Multiclass Classification: The number of classes is more than 2. For Example – On the basis of data
about different species of flowers, we have to determine which specie our observation belongs.

Fig: Binary and Multiclass Classification. Here x1 and x2 are the variables upon which the class is predicted.

How does classification works?


→ Suppose we have to predict whether a given patient has a certain disease or not, on the basis of 3
variables, called features.
→ This means there are two possible outcomes:
• The patient has the said disease. Basically, a result labeled “Yes” or “True”.
• The patient is disease−free. A result labeled “No” or “False”.
This is a binary classification problem. We have a set of observations called the training data set, which
comprises sample data with actual classification results. We train a model, called Classifier on this data set,
and use that model to predict whether a certain patient will have the disease or not.

Prepared by S. Abinaya AP/CSE (SRRCET)


Page 4

The outcome, thus now depends upon:


1. How well these features are able to “map” to the outcome.
2. The quality of our data set. By quality, I refer to statistical and Mathematical qualities.
3. How well our Classifier generalizes this relationship between the features and the outcome.
4. The values of the x1 and x2.
Following is the generalized block diagram of the classification task.
Generalized Classification Block Diagram:
1. X: pre−classified data, in the form of an N*M matrix. N is the no. of observations and M is the number
of features
2. y: An N−d vector corresponding to predicted classes for each of the N observations.
3. Feature Extraction: Extracting valuable information from input X using a series of transforms.
4. ML Model: The “Classifier” we’ll train.
5. y’: Labels predicted by the Classifier.
6. Quality Metric: Metric used for measuring the performance of the model.
7. ML Algorithm: The algorithm that is used to update weights w’, which updates the model and “learns”
iteratively.

Types of Classifiers (algorithms)


There are various types of classifiers. Some of them are:
Linear Classifiers: Logistic Regression
Tree−Based Classifiers: Decision Tree Classifier
Support Vector Machines
Artificial Neural Networks
Bayesian Regression
Gaussian Naive Bayes Classifiers
Stochastic Gradient Descent (SGD) Classifier
Ensemble Methods: Random Forests, AdaBoost, Bagging Classifier, Voting Classifier,
ExtraTrees Classifier

Prepared by S. Abinaya AP/CSE (SRRCET)


Page 5

Learners in Classification Problems:


In the classification problems, there are two types of learners:
1. Lazy Learners: Lazy Learner firstly stores the training dataset and wait until it receives the test dataset.
In Lazy learner case, classification is done on the basis of the most related data stored in the training
dataset. It takes less time in training but more time for predictions. Example: K−NN algorithm,
Case−based reasoning
2. Eager Learners: Eager Learners develop a classification model based on a training dataset before
receiving a test dataset. Opposite to Lazy learners, Eager Learner takes more time in learning, and less
time in prediction. Example: Decision Trees, Naïve Bayes, ANN.
Evaluating a Classification model:
Once our model is completed, it is necessary to evaluate its performance; either it is a Classification or
Regression model. So, for evaluating a Classification model, we have the following ways:
1. Log Loss or Cross-Entropy Loss:
o It is used for evaluating the performance of a classifier, whose output is a probability value
between the 0 and 1.
o For a good binary Classification model, the value of log loss should be near to 0.
o The value of log loss increases if the predicted value deviates from the actual value.
o The lower log loss represents the higher accuracy of the model.
o For Binary classification, cross−entropy can be calculated as:
1.? (ylog(p)+(1?y)log(1?p))
Where y= Actual output, p= predicted output.
2. Confusion Matrix:
o The confusion matrix provides us a matrix/table as output and describes the performance of the
model. It is also known as the error matrix.
o The matrix consists of predictions result in a summarized form, which has a total number of
correct predictions and incorrect predictions. The matrix looks like as below table:

Actual Positive Actual Negative

Predicted Positive True Positive False Positive

Predicted Negative False Negative True Negative

3. AUC-ROC curve:
o ROC curve stands for Receiver Operating Characteristics Curve and AUC stands for
Area Under the Curve.

Prepared by S. Abinaya AP/CSE (SRRCET)


Page 6

o It is a graph that shows the performance of the classification model at different thresholds.
o To visualize the performance of the multi−class classification model, we use the AUC− ROC
Curve.
o The ROC curve is plotted with TPR and FPR, where TPR (True Positive Rate) on Y−axis and
FPR (False Positive Rate) on X−axis.
Use cases of Classification Algorithms
Classification algorithms can be used in different places. Below are some popular use cases of
Classification Algorithms:
→ Email Spam Detection
→ Speech Recognition
→ Identifications of Cancer tumor cells.
→ Drugs Classification
→ Biometric Identification, etc.

III. REGRESSION
→ Regression in machine learning refers to a supervised learning technique where the goal is to
predict a continuous numerical value based on one or more independent features. It finds relationships
between variables so that predictions can be made. we have two types of variables present in
regression:
Dependent Variable (Target): The variable we are trying to predict e.g house price.
Independent Variables (Features): The input variables that influence the prediction e.g
locality, number of rooms.
Regression analysis problem works with if output variable is a real or continuous value such as “salary” or
“weight”. Many different regression models can be used but the simplest model in them is linear regression.
Types of Regression
Regression can be classified into different types based on the number of predictor variables and the
nature of the relationship between variables:
1. Simple Linear Regression
o Linear regression is one of the simplest and most widely used statistical models. This
assumes that there is a linear relationship between the independent and dependent variables.
o This means that the change in the dependent variable is proportional to the change in the
independent variables. For example, predicting the price of a house based on its size.
2. Multiple Linear Regression
o Multiple linear regression extends simple linear regression by using multiple independent
variables to predict target variable. For example, predicting the price of a house based on
multiple features such as size, location, number of rooms, etc.

Prepared by S. Abinaya AP/CSE (SRRCET)


Page 7

3. Polynomial Regression
o Polynomial regression is used to model with non-linear relationships between the dependent
variable and the independent variables.
o It adds polynomial terms to the linear regression model to capture more complex
relationships. For example, when we want to predict a non-linear trend like population growth
over time, we use polynomial regression.
4. Ridge & Lasso Regression
o Ridge & lasso regression are regularized versions of linear regression that help avoid
overfitting by penalizing large coefficients. When there’s a risk of overfitting due to too many
features we use these type of regression algorithms.
5. Support Vector Regression (SVR)
o SVR is a type of regression algorithm that is based on the Support Vector Machine
(SVM) algorithm.
o SVM is a type of algorithm that is used for classification tasks but it can also be used for
regression tasks.
o SVR works by finding a hyperplane that minimizes the sum of the squared residuals between
the predicted and actual values.
6. Decision Tree Regression
o Decision tree Uses a tree-like structure to make decisions where each branch of tree
represents a decision and leaves represent outcomes.
o For example, predicting customer behaviour based on features like age, income, etc there we
use decision tree regression.
7. Random Forest Regression
o Random Forest is an ensemble method that builds multiple decision trees and each tree is
trained on a different subset of the training data. The final prediction is made by averaging the
predictions of all of the trees. For example, customer churn or sales data using this.
Regression Evaluation Metrics:
Evaluation in machine learning measures the performance of a model. Here are some popular
evaluation metrics for regression:
Mean Absolute Error (MAE): The average absolute difference between the predicted and actual
values of the target variable.
Mean Squared Error (MSE): The average squared difference between the predicted and actual
values of the target variable.
Root Mean Squared Error (RMSE): Square root of the mean squared error.
Huber Loss: A hybrid loss function that transitions from MAE to MSE for larger errors, providing
balance between robustness and MSE’s sensitivity to outliers.

Prepared by S. Abinaya AP/CSE (SRRCET)


Page 8

R2 – Score: Higher values indicate better fit ranging from 0 to 1.


Applications of Regression
→ Predicting prices: Used to predict the price of a house based on its size, location and other features.
→ Forecasting trends: Model to forecast the sales of a product based on historical sales data.
→ Identifying risk factors: Used to identify risk factors for heart patient based on patient medical data.
→ Making decisions: It could be used to recommend which stock to buy based on market data.
Advantages of Regression
→ Easy to understand and interpret.
→ Robust to outliers.
→ Can handle both linear relationships easily.
Disadvantages of Regression
→ Assumes linearity.
→ Sensitive to situation where two or more independent variables are highly correlated with each other
i.e multicollinearity.
→ May not be suitable for highly complex relationships.

IV. APPROACHES OF MACHINE LEARNING MODELS


→ In today’s world, Machine learning becomes one of the popular and exciting fields of study that gives
machines the ability to learn and become more accurate at predicting outcomes for the unseen data
i.e, not seen the data in prior.
→ The ideas in Machine learning overlaps and receives from Artificial Intelligence and many other
related technologies. Today, machine learning is evolved from Pattern Recognition and the concept
that computers can learn without being explicitly programmed to performing specific tasks.
→ We can use the Machine Learning algorithms (e.g, Logistic Regression, Naive Bayes, etc) to
• Recognize spoken words,
• Data Mining, and
• Build applications that learn from data, etc.
→ And the improvement of these algorithms in terms of accuracy increases over time.
→ Machine learning models can be classified into two types of models
Discriminative and Generative models.
→ In simple words, a discriminative model makes predictions on the unseen data based on conditional
probability and can be used either for classification or regression problem statements.
→ On the contrary, a generative model focuses on the distribution of a dataset to return a probability for a
given example.

Prepared by S. Abinaya AP/CSE (SRRCET)


Page 9

We as a human can adopt any of the two different approaches to machine learning models while learning an
artificial language. These two models have not previously been explored in human learning. However, it is
related to known effects of causal direction, classification vs. inference learning, and observational vs.
feedback learning.

Problem Formulation
Suppose we are working on a classification problem where our task is to decide if an email is a
spam or not spam based on the words present in a particular email. To solve this problem, we have a joint
model over
• Labels: Y=y, and
• Features: X = {x1, x2, …xn}
Therefore, the joint distribution of the model can be represented as
p(Y, X) = P(y, x1, x2,…xn)
Now, our goal is to estimate the probability of spam email i.e, P(Y=1|X). Both generative and discriminative
models can solve this problem but in different ways.
Let’s see why and how they are different!
The approach of Generative Models
In the case of generative models, to find the conditional probability P(Y|X), they estimate the prior
probability P(Y) and likelihood probability P(X|Y) with the help of the training data and uses the Bayes
Theorem to calculate the posterior probability P(Y |X):

The approach of Discriminative Models


In the case of discriminative models, to find the probability, they directly assume some functional
form for P(Y|X) and then estimate the parameters of P(Y|X) with the help of the training data.
What are Discriminative Models?
→ The discriminative model refers to a class of models used in Statistical Classification, mainly used
for supervised machine learning. These types of models are also known as conditional models
since they learn the boundaries between classes or labels in a dataset.
→ Discriminative models (just as in the literal meaning) separate classes instead of modeling the

Prepared by S. Abinaya AP/CSE (SRRCET)


Page 10

conditional probability and don’t make any assumptions about the data points. But these models
are not capable of generating new data points. Therefore, the ultimate objective of discriminative
models is to separate one class from another.
→ If we have some outliers present in the dataset, then discriminative models work better compared
to generative models i.e., discriminative models are more robust to outliers. However, there is one
major drawback of these models is the misclassification problem, i.e., wrongly classifying a data
point.

Training discriminative classifiers involve estimating a function f: X -> Y, or probability P(Y|X)


• Assume some functional form for the probability such as P(Y|X)
• With the help of training data, we estimate the parameters of P(Y|X)
Some Examples of Discriminative Models
• Logistic regression
• Scalar Vector Machine (SVMs)
• Traditional neural networks
• Nearest neighbor
• Conditional Random Fields (CRFs)
• Decision Trees and Random Forest
What are Generative Models?
Generative models are considered as a class of statistical models that can generate new data
instances. These models are used in unsupervised machine learning as a means to perform tasks such as
• Probability and Likelihood estimation,
• Modeling data points,
• To describe the phenomenon in data,
• To distinguish between classes based on these probabilities.
Since these types of models often rely on the Bayes theorem to find the joint probability, so
generative models can tackle a more complex task than analogous discriminative models.
So, Generative models focus on the distribution of individual classes in a dataset and the learning

Prepared by S. Abinaya AP/CSE (SRRCET)


Page 11

algorithms tend to model the underlying patterns or distribution of the data points. These models use the
concept of joint probability and create the instances where a given feature (x) or input and the desired output
or label (y) exist at the same time.
These models use probability estimates and likelihood to model data points and differentiate
between different class labels present in a dataset. Unlike discriminative models, these models are also
capable of generating new data points.

Mathematical things involved in Generative Models


Training generative classifiers involve estimating a function f: X -> Y, or probability P(Y|X):
• Assume some functional form for the probabilities such as P(Y), P(X|Y)
• With the help of training data, we estimate the parameters of P(X|Y), P(Y)
• Use the Bayes theorem to calculate the posterior probability P(Y |X)
Some Examples of Generative Models
• Naïve Bayes
• Bayesian networks
• Markov random fields
• Hidden Markov Models (HMMs)
• Latent Dirichlet Allocation (LDA)
• Generative Adversarial Networks (GANs)
• Autoregressive Model
Difference between Discriminative and Generative Models
Let’s see some of the differences between Discriminative and Generative Models.
Core Idea
Discriminative models draw boundaries in the data space, while generative models try to model how
data is placed throughout the space. A generative model focuses on explaining how the data was generated,
while a discriminative model focuses on predicting the labels of the data.
Mathematical Intuition
In mathematical terms, a discriminative machine learning trains a model which is done by learning
parameters that maximize the conditional probability P(Y|X), while on the other hand, a generative model

Prepared by S. Abinaya AP/CSE (SRRCET)


Page 12

learns parameters by maximizing the joint probability of P(X, Y).


Applications
Discriminative models recognize existing data i.e, discriminative modeling identifies tags and sorts
data and can be used to classify data while Generative modeling produces something.
Since these models use different approaches to machine learning, so both are suited for specific
tasks i.e, Generative models are useful for unsupervised learning tasks while discriminative models are
useful for supervised learning tasks.
Outliers
Generative models have more impact on outliers than discriminative models.
Computational Cost
Discriminative models are computationally cheap as compared to generative models.
Comparison between Discriminative and Generative Models
Let’s see some of the comparisons based on the following criteria between Discriminative and Generative
Models:
• Performance
• Missing Data
• Accuracy Score
• Applications
Based on Performance
Generative models need fewer data to train compared with discriminative models since generative
models are more biased as they make stronger assumptions i.e, assumption of conditional independence.
Based on Missing Data
In general, if we have missing data in our dataset, then Generative models can work with these
missing data, while on the contrary discriminative models can’t. This is because, in generative models, still
we can estimate the posterior by marginalizing over the unseen variables. However, for discriminative
models, we usually require all the features X to be observed.
Based on Accuracy Score:If the assumption of conditional independence violates, then at that time generative
models are less accurate than discriminative models.
Based on Applications: Discriminative models are called “discriminative” since they are useful for
discriminating Y’s label i.e, target outcome, so they can only solve classification problems while Generative
models have more applications besides classification such as,
• Samplings,
• Bayes learning,
• MAP inference, etc.

Prepared by S. Abinaya AP/CSE (SRRCET)


Page 13

V. TYPES OF LEARNING
The main types of learning in machine learning are categorized based on how the model learns from data and
the nature of the data itself. These include:
• Supervised Learning:
o Description: The model learns from labeled data, where each input example is paired with its
corresponding correct output. The goal is to learn a mapping from inputs to outputs so that the
model can predict outputs for new, unseen inputs.
o Examples: Classification (predicting categories, e.g., spam detection) and Regression
(predicting continuous values, e.g., house price prediction).
• Unsupervised Learning:
o Description: The model learns from unlabeled data, aiming to discover hidden patterns,
structures, or relationships within the data without explicit guidance.
o Examples: Clustering (grouping similar data points, e.g., customer segmentation) and
Dimensionality Reduction (reducing the number of features while retaining important
information).
• Reinforcement Learning:
o Description: An agent learns to make decisions by interacting with an environment. It receives
rewards for desirable actions and penalties for undesirable ones, aiming to maximize
cumulative reward over time.
o Examples: Game playing (e.g., AlphaGo) and Robotics (teaching robots to perform tasks).
• Semi-supervised Learning:
o Description: This approach combines aspects of both supervised and unsupervised learning. It
utilizes a small amount of labeled data along with a larger amount of unlabeled data to train a
model.
o Examples: Text classification with limited labeled documents, image recognition.
• Self-supervised Learning:
o Description: A subset of unsupervised learning where the model generates its own labels from
the input data itself, effectively creating a supervised learning task from unlabeled data.
o Examples: Pre-training large language models (like BERT or GPT) by predicting masked
words or next sentences.
5. Deep Learning:
• Concept:
A subfield of machine learning that utilizes artificial neural networks with multiple layers (deep neural
networks) to learn complex patterns from large datasets. Deep learning models can be applied to
supervised, unsupervised, and reinforcement learning tasks.

Prepared by S. Abinaya AP/CSE (SRRCET)


Page 14

• Examples:
o Convolutional Neural Networks (CNNs): Primarily used for image and video analysis.
o Recurrent Neural Networks (RNNs) and Transformers: Primarily used for sequential data
like natural language processing (NLP) and time series.

VI. PROBABILITY - BASICS


Probability in machine learning:
→ Probability is the bedrock of ML, which tells how likely is the event to occur. The value of Probability
always lies between 0 to 1. It is the core concept as well as a primary prerequisite to understanding the
ML models and their applications.
→ Probability can be calculated by the number of times the event occurs divided by the total number
of possible outcomes. Let's suppose we tossed a coin, then the probability of getting head as a possible
outcome can be calculated as below formula:

P (H) = Number of ways to head occur/ total number of possible outcomes


P (H) = ½
P (H) = 0.5
Where,
P (H) = Probability of occurring Head as outcome while tossing a coin.
Types of Probability
For better understanding the Probability, it can be categorized further in different types as follows:
Empirical Probability: Empirical Probability can be calculated as the number of times the event
occurs divided by the total number of incidents observed.
Theoretical Probability: Theoretical Probability can be calculated as the number of ways the
particular event can occur divided by the total number of possible outcomes.
Joint Probability: It tells the Probability of simultaneously occurring two random events.
P(A ∩ B) = P(A). P(B)
Where,
P(A ∩ B) = Probability of occurring events A and B both.
P (A) = Probability of event A
P (B) = Probability of event B
Conditional Probability: It is given by the Probability of event A given that event B occurred.
The Probability of an event A conditioned on an event B is denoted and defined as;
P(A|B) = P(A∩B)/P(B)
Similarly, P(B|A) = P(A ∩ B)/ P(A) . We can write the joint Probability of as A and B as P(A ∩ B)=
p(A).P(B|A), which means: "The chance of both things happening is the chance that the first one happens, and
then the second one is given when the first thing happened."
We have a basic understanding of Probability required to learn Machine Learning.

Prepared by S. Abinaya AP/CSE (SRRCET)


Page 15

Now, we will discuss the basic introduction of Statistics for ML.


Statistics in Machine Learning:
Statistics is also considered as the base foundation of machine learning which deals with finding
answers to the questions that we have about data. In general, we can define statistics as:
Statistics can be categorized into 2 major parts. These are as follows:
→ Descriptive Statistics
→ Inferential Statistics
Use of Statistics in ML:
Statistics methods are used to understand the training data as well as interpret the results of testing
different machine learning models. Further, Statistics can be used to make better−informed business and
investing decisions.
Conditional probability and Bayesian theorem:
Conditional probabilities arise naturally in the investigation of experiments where an outcome of a trial
may affect the outcomes of the subsequent trials. We try to calculate the probability of the second event (event
B) given that the first event (event A) has already happened. If the probability of the event changes when
we take the first event into consideration, we can safely say that the probability of event B is dependent of the
occurrence of event A.
Let’s think of cases where this happens:
→ Drawing a second ace from a deck given we got the first ace
→ Finding the probability of having a disease given you were tested positive
→ Finding the probability of liking Harry Potter given we know the person likes fiction
And so on….
Here we can define, 2 events:
→ Event A is the probability of the event we’re trying to calculate.
→ Event B is the condition that we know or the event that has happened.
We can write the conditional probability as, the probability of the occurrence of event A given that
B has already happened.

Bayes Theorem
→ Bayesian decision theory refers to the statistical approach based on trade-off quantification among
various classification decisions based on the concept of Probability(Bayes Theorem) and the costs
associated with the decision.

Prepared by S. Abinaya AP/CSE (SRRCET)


Page 16

→ It is basically a classification technique that involves the use of the Bayes Theorem which is used to
find the conditional probabilities.
→ The Bayes theorem describes the probability of an event based on the prior knowledge of the conditions
that might be related to the event. The conditional probability of A given B, represented by P(A | B) is
the chance of occurrence of A given that B has occurred.
P(A | B) = P(A,B)/P(B) or
By Using the Chain rule, this can also be written as:
P(A, B) = P(A|B)P(B)=P(B|A)P(A)
P(A | B) = P(B|A)P(A)/P(B) —— (1)
Where,
P(B) = P(B,A) + P(B,A’) = P(B|A)P(A) + P(B|A’)P(A’)
Here, equation (1) is known as the Bayes Theorem of probability

Our aim is to explore each of the components included in this theorem. Let’s explore step by step:
a) Prior or State of Nature:
→ Prior probabilities represent how likely is each Class is going to occur.
→ Priors are known before the training process.
→ The state of nature is a random variable P(wi).
→ If there are only two classes, then the sum of the priors is P(w1) + P(w2)=1, if the classes are
exhaustive.
b) Class Conditional Probabilities:
→ It represents the probability of how likely a feature x occurs given that it belongs to the particular class.
It is denoted by, P(X|A) where x is a particular feature
→ It is the probability of how likely the feature x occurs given that it belongs to the class wi.
→ Sometimes, it is also known as the Likelihood.
→ It is the quantity that we have to evaluate while training the data. During the training process, we have
input(features) X labeled to corresponding class w and we figure out the likelihood of occurrence of that
set of features given the class label.

Prepared by S. Abinaya AP/CSE (SRRCET)


Page 17

Evidence:
→ It is the probability of occurrence of a particular feature i.e. P(X).
→ It can be calculated using the chain rule as, P(X) = Σin P(X | wi) P(wi).
→ As we need the likelihood of class conditional probability is also figure out evidence values during
training.
Posterior Probabilities:
→ It is the probability of occurrence of Class A when certain Features are given
→ It is what we aim at computing in the test phase in which we have testing input/features (the given entity)
& have to find how likely trained model can predict features belonging to the particular class wi.

VII. LINEAR ALGEBRA


What is Linear Algebra?
This is a branch of mathematic that concerns the study of the vectors and certain rules to manipulate the vector.
When we are formalizing intuitive concepts, the common approach is to construct a set of objects (symbols) and a
set of rules to manipulate these objects. This is what we knew as algebra.
If we talk about Linear Algebra in machine learning, it is defined as the part of mathematics that uses vector
space and matrices to represent linear equations.
When talking about vectors, people might flashback to their high school study regarding the vector with
direction, just like the image below.

Geometric Vector
This is a vector, but not the kind of vector discussed in the Linear Algebra for Machine Learning.
Instead, it would be this image below we would talk about.

vector 4x1 Matrix

Prepared by S. Abinaya AP/CSE (SRRCET)


Page 18

What we had above is also a Vector, but another kind of vector. You might be familiar with matrix form (the
image below). The vector is a matrix with only 1 column, which is known as a column vector. In other words,
we can think of a matrix as a group of column vectors or row vectors. In summary, vectors are special objects
that can be added together and multiplied by scalars to produce another object of the same kind. We could have
various objects called vectors.

Matrix
→ Linear algebra itself s a systematic representation of data that computers can understand, and all the
operations in linear algebra are systematic rules. That is why in modern time machine learning, Linear
algebra is an important study.
→ An example of how linear algebra is used is in the linear equation. Linear algebra is a tool used in the
Linear Equation because so many problems could be presented systematically in a Linear way. The
typical Linear equation is presented in the form below.

Linear Equation
To solve the linear equation problem above, we use Linear Algebra to present the linear equation in a
systematical representation. This way, we could use the matrix characterization to look for the most optimal
solution.

Linear Equation in Matrix Representation


To summary the Linear Algebra subject, there are three terms you might want to learn more as a
starting point within this subject:
• Vector
• Matrix
• Linear Equation

Prepared by S. Abinaya AP/CSE (SRRCET)


Page 19

Analytic Geometry (Coordinate Geometry)


Analytic geometry is a study in which we learn the data (point) position using an ordered pair of
coordinates. This study is concerned with defining and representing geometrical shapes numerically and
extracting numerical information from the shapes numerical definitions and representations. We project the
data into the plane in a simpler term, and we receive numerical information from there.

Cartesian Coordinate

Above is an example of how we acquired information from the data point by projecting the dataset into
the plane. How we acquire the information from this representation is the heart of Analytical Geometry. To
help you start learning this subject, here are some important terms you might need.
Distance Function
A distance function is a function that provides numerical information for the distance between the elements
of a set. If the distance is zero, then elements are equivalent. Else, they are different from each other.
An example of the distance function is Euclidean Distance which calculates the linear distance
between two data points.

Euclidean Distance Equation


Inner Product
The inner product is a concept that introduces intuitive geometrical concepts, such as the length of a
vector and the angle or distance between two vectors. It is often denoted as
⟨x,y⟩ (or occasionally (x,y) or ⟨x|y⟩).

VIII. HYPOTHESIS SPACE.


Hypothesis Space
→ The hypothesis space (H) in machine learning refers to the set of all possible functions or models that
a learning algorithm can potentially consider to explain the relationship between input features and
target outputs.

Prepared by S. Abinaya AP/CSE (SRRCET)


Page 20

→ The concept of a hypothesis is fundamental in Machine Learning and data science endeavours. In the
realm of machine learning, a hypothesis serves as an initial assumption made by data scientists and
ML professionals when attempting to address a problem. Machine learning involves conducting
experiments based on past experiences, and these hypotheses are crucial in formulating potential
solutions.
Hypothesis in Machine Learning
A hypothesis in machine learning is the model's presumption regarding the connection between
the input features and the result. It is an illustration of the mapping function that the algorithm is attempting
to discover using the training set. To minimize the discrepancy between the expected and actual outputs, the
learning process involves modifying the weights that parameterize the hypothesis. The objective is to optimize
the model's parameters to achieve the best predictive performance on new, unseen data, and a cost function is
used to assess the hypothesis' accuracy.
How does a Hypothesis work?
In most supervised machine learning algorithms, our main goal is to find a possible hypothesis from
the hypothesis space that could map out the inputs to the proper outputs. The following figure shows the
common method to find out the possible hypothesis from the Hypothesis space:
Hypothesis Space (H)
Hypothesis space is the set of all the possible legal hypothesis. This is the set from which the machine
learning algorithm would determine the best possible (only one) which would best describe the target function
or the outputs.
Hypothesis (h)
A hypothesis is a function that best describes the target in supervised machine learning. The hypothesis
that an algorithm would come up depends upon the data and also depends upon the restrictions and bias that
we have imposed on the data.

The Hypothesis can be calculated as:


y=mx+b → where, y = range, m = slope of the lines, x = domain, b = intercept

Prepared by S. Abinaya AP/CSE (SRRCET)


Page 21

To better understand the Hypothesis Space and Hypothesis consider the following coordinate that
shows the distribution of some data:

Say suppose we have test data for which we have to determine the outputs or results. The test data is as shown
below:

We can predict the outcomes by dividing the coordinate as shown below:

So the test data would yield the following result:

Prepared by S. Abinaya AP/CSE (SRRCET)


Page 22

But note here that we could have divided the coordinate plane as:

→ The way in which the coordinate would be divided depends on the data, algorithm and
constraints.
→ All these legal possible ways in which we can divide the coordinate plane to predict the
outcome of the test data composes of the Hypothesis Space.
→ Each individual possible way is known as the hypothesis.
→ Hence, in this example the hypothesis space would be like:

Hypothesis Space and Representation in Machine Learning


The hypothesis space comprises all possible legal hypotheses that a machine learning algorithm can
consider. Hypotheses are formulated based on various algorithms and techniques, including linear regression,
decision trees, and neural networks. These hypotheses capture the mapping function transforming input data
into predictions.
Hypothesis Formulation and Representation in Machine Learning
Hypotheses in machine learning are formulated based on various algorithms and techniques, each with
its representation. For example:
Linear Regression: h(X)=θ0+θ1X1+θ2X2+...+θnXnh(X)=θ0+θ1X1+θ2X2+...+θnXn
Decision Trees: h(X)=Tree(X)h(X)=Tree(X)
Neural Networks: h(X)=NN(X)h(X)=NN(X)
In the case of complex models like neural networks, the hypothesis may involve multiple layers of
interconnected nodes, each performing a specific computation.

Prepared by S. Abinaya AP/CSE (SRRCET)


Page 23

Hypothesis Evaluation:
The process of machine learning involves not only formulating hypotheses but also evaluating their
performance. This evaluation is typically done using a loss function or an evaluation metric that quantifies the
disparity between predicted outputs and ground truth labels. Common evaluation metrics include mean
squared error (MSE), accuracy, precision, recall, F1-score, and others. By comparing the predictions of the
hypothesis with the actual outcomes on a validation or test dataset, one can assess the effectiveness of the
model.
Hypothesis Testing and Generalization:
Once a hypothesis is formulated and evaluated, the next step is to test its generalization capabilities.
Generalization refers to the ability of a model to make accurate predictions on unseen data. A hypothesis that
performs well on the training dataset but fails to generalize to new instances is said to suffer from overfitting.
Conversely, a hypothesis that generalizes well to unseen data is deemed robust and reliable.
The process of hypothesis formulation, evaluation, testing, and generalization is often iterative in nature. It
involves refining the hypothesis based on insights gained from model performance, feature importance, and
domain knowledge. Techniques such as hyperparameter tuning, feature engineering, and model selection play
a crucial role in this iterative refinement process.
Hypothesis in Statistics
In statistics, a hypothesis refers to a statement or assumption about a population parameter. It is a
proposition or educated guess that helps guide statistical analyses. There are two types of hypotheses: the null
hypothesis (H0) and the alternative hypothesis (H1 or Ha).
Null Hypothesis(H0): This hypothesis suggests that there is no significant difference or effect, and any
observed results are due to chance. It often represents the status quo or a baseline assumption.
Aternative Hypothesis(H1 or Ha): This hypothesis contradicts the null hypothesis, proposing that there is a
significant difference or effect in the population. It is what researchers aim to support with evidence.

IX. INDUCTIVE BIAS


What is Inductive Bias in Machine Learning?
→ In the realm of machine learning, the concept of inductive bias plays a pivotal role in shaping how
algorithms learn from data and make predictions.
→ It serves as a guiding principle that helps algorithms generalize from the training data to unseen data,
ultimately influencing their performance and decision-making processes. In this article, we delve into
the intricacies of inductive bias, its significance in machine learning, and its implications for model
development and interpretation.

Prepared by S. Abinaya AP/CSE (SRRCET)


Page 24

What is Inductive Bias?


→ Inductive bias can be defined as the set of assumptions or biases that a learning algorithm employs to
make predictions on unseen data based on its training data. These assumptions are inherent in the
algorithm's design and serve as a foundation for learning and generalization.
→ The inductive bias of an algorithm influences how it selects a hypothesis (a possible explanation or
model) from the hypothesis space (the set of all possible hypotheses) that best fits the training data. It
helps the algorithm navigate the trade-off between fitting the training data perfectly (overfitting) and
generalizing well to unseen data (underfitting).
Types of Inductive Bias
Inductive bias can manifest in various forms, depending on the algorithm and its underlying
assumptions. Some common types of inductive bias include:
Bias towards simpler explanations: Many machine learning algorithms, such as decision trees and
linear models, have a bias towards simpler hypotheses. They prefer explanations that are more
parsimonious and less complex, as these are often more likely to generalize well to unseen data.
Bias towards smoother functions: Algorithms like kernel methods or Gaussian processes have a
bias towards smoother functions. They assume that neighbouring points in the input space should have
similar outputs, leading to smooth decision boundaries.
Bias towards specific types of functions: Neural networks, for example, have a bias towards
learning complex, nonlinear functions. This bias allows them to capture intricate patterns in the data
but can also lead to overfitting if not regularized properly.
Bias towards sparsity: Some algorithms, like Lasso regression, have a bias towards sparsity. They
prefer solutions where only a few features are relevant, which can improve interpretability and
generalization.
Importance of Inductive Bias
→ Inductive bias is crucial in machine learning as it helps algorithms generalize from limited training
data to unseen data. Without a well-defined inductive bias, algorithms may struggle to make accurate
predictions or may overfit the training data, leading to poor performance on new data.
→ Understanding the inductive bias of an algorithm is essential for model selection, as different biases
may be more suitable for different types of data or tasks. It also provides insights into how the
algorithm is learning and what assumptions it is making about the data, which can aid in interpreting
its predictions and results.
Challenges and Considerations
→ While inductive bias is essential for learning, it can also introduce limitations and challenges. Biases
that are too strong or inappropriate for the data can lead to poor generalization or biased predictions.
Balancing bias with variance (the variability of predictions) is a key challenge in machine learning,
requiring careful tuning and model selection.

Prepared by S. Abinaya AP/CSE (SRRCET)


Page 25

→ Additionally, the choice of inductive bias can impact the interpretability of the model. Simpler biases
may lead to more interpretable models, while more complex biases may sacrifice interpretability for
improved performance.
Inductive bias is a fundamental concept in machine learning that shapes how algorithms learn and
generalize from data. It serves as a guiding principle that influences the selection of hypotheses and the
generalization of models to unseen data. Understanding the inductive bias of an algorithm is essential for
model development, selection, and interpretation, as it provides insights into how the algorithm is learning
and making predictions. By carefully considering and balancing inductive bias, machine learning practitioners
can develop models that generalize well and provide valuable insights into complex datasets.

X. EVALUATION
→ Evaluation in machine learning is the process of assessing the performance of a trained model or
hypothesis. This is crucial for understanding how well the model generalizes to new, unseen data and
for comparing different models.
→ Evaluation typically involves:
Splitting Data: Dividing the available dataset into training, validation, and test sets. The model
is trained on the training set, hyper-parameters are tuned using the validation set, and the final
performance is measured on the unseen test set.
Metrics: Using appropriate metrics to quantify performance.
Example: For classification, metrics like accuracy, precision, recall, F1-score, or AUC-ROC are
used. For regression, metrics like Mean Squared Error (MSE), Root Mean Squared Error (RMSE), or R-
squared are common.
Employing techniques like k-fold cross-validation to get a more robust estimate of the model's
performance, especially when data is limited.
Example: After training a classification model, we can evaluate its performance by calculating its
accuracy on a held-out test set. If the accuracy is 85%, it indicates that the model correctly classifies 85% of
the examples in the test set.

XI. TRAINING AND TEST SETS, CROSS VALIDATION, CONCEPT OF OVER FITTING, UNDER
FITTING, BIAS AND VARIANCE.
1. Training and Test Sets
Training Set:
This is the portion of the dataset used to train the machine learning model. The model learns patterns
and relationships from this data.

Prepared by S. Abinaya AP/CSE (SRRCET)


Page 26

Test Set:
This is the unseen portion of the dataset used to evaluate the model's performance on new, unobserved
data. It assesses how well the model generalizes.
Example: Imagine training a model to predict house prices. The training set would contain features (size,
location, number of rooms) and corresponding prices for houses the model learns from. The test set would
contain similar features for houses the model hasn't seen, and its predictions would be compared to the actual
prices to gauge accuracy.
2. Cross-Validation
Cross-validation is a technique to estimate a model's performance and robustness more reliably than a
single train-test split. It involves partitioning the data into multiple subsets (folds).
K-Fold Cross-Validation: The dataset is divided into 'k' equal-sized folds. The model is trained 'k' times,
each time using 'k-1' folds for training and one fold for testing. The results are then averaged.
Example: For a 5-fold cross-validation, the data is split into 5 parts. In the first iteration, folds 2-5 are
used for training, and fold 1 for testing. In the second, folds 1, 3-5 train, and fold 2 tests, and so on.
3. Overfitting and Underfitting
Overfitting occurs when a model learns the training data too well, including its noise and outliers,
leading to poor performance on new, unseen data.
Example: A model trained to classify images of cats and dogs overfits if it perfectly identifies the cats
and dogs in the training set but fails to recognize new cat and dog images because it memorized specific
features of the training images instead of learning general characteristics.
Underfitting occurs when a model is too simple to capture the underlying patterns in the training data,
resulting in poor performance on both training and test data.
Example: Using a simple linear regression model to predict a non-linear relationship (e.g., a curved
trend) between two variables would likely underfit, as the model cannot capture the complexity of the data.
→ Overfitting and Underfitting are the two main problems that occur in machine learning and degrade the
performance of the machine learning models.
→ The main goal of each machine learning model is to generalize well. Here generalization
defines the ability of an ML model to provide a suitable output by adapting the given set of unknown
input. It means after providing training on the dataset, it can produce reliable and accurate output.
→ Hence, the underfitting and overfitting are the two terms that need to be checked for the performance
of the model and whether the model is generalizing well or not.
Before understanding the overfitting and underfitting, let's understand some basic term that will help to
understand this topic well:
Signal: It refers to the true underlying pattern of the data that helps the machine learning model to
learn from the data.
Noise: Noise is unnecessary and irrelevant data that reduces the performance of the model.

Prepared by S. Abinaya AP/CSE (SRRCET)


Page 27

Bias: Bias is a prediction error that is introduced in the model due to oversimplifying the machine
learning algorithms. Or it is the difference between the predicted values and the actual values.
Variance: If the machine learning model performs well with the training dataset, but does not perform
well with the test dataset, then variance occurs.
Overfitting
→ Overfitting occurs when our machine learning model tries to cover all the data points or more than the
required data points present in the given dataset. Because of this, the model starts caching noise and
inaccurate values present in the dataset, and all these factors reduce the efficiency and accuracy of the
model. The overfitted model has low bias and high variance.
→ The chances of occurrence of overfitting increase as much we provide training to our model. It means
the more we train our model, the more chances of occurring the overfitted model.
→ Overfitting is the main problem that occurs in supervised learning.
Example: The concept of the overfitting can be understood by the below graph of the linear regression output:

As we can see from the above graph, the model tries to cover all the data points present in the scatter
plot. It may look efficient, but in reality, it is not so. Because the goal of the regression model to find the best fit
line, but here we have not got any best fit, so, it will generate the prediction errors.
How to avoid the Overfitting in Model
Both overfitting and underfitting cause the degraded performance of the machine learning model. But
the main cause is overfitting, so there are some ways by which we can reduce the occurrence of overfitting in
our model.
→ Cross−Validation
→ Training with more data
→ Removing features
→ Early stopping the training
→ Regularization
→ Ensembling

Prepared by S. Abinaya AP/CSE (SRRCET)


Page 28

Underfitting
→ Underfitting occurs when our machine learning model is not able to capture the underlying trend of
the data.
→ To avoid the overfitting in the model, the fed of training data can be stopped at an early stage, due to
which the model may not learn enough from the training data. As a result, it may fail to find the best
fit of the dominant trend in the data.
→ In the case of underfitting, the model is not able to learn enough from the training data, and hence it
reduces the accuracy and produces unreliable predictions. An underfitted model has high bias and low
variance.
Example: We can understand the underfitting using below output of the linear regression model:

As we can see from the above diagram, the model is unable to capture the data points present in the plot.
How to avoid underfitting:
→ By increasing the training time of the model.
→ By increasing the number of features.
Goodness of Fit
→ The "Goodness of fit" term is taken from the statistics, and the goal of the machine learning models to
achieve the goodness of fit. In statistics modeling, it defines how closely the result or predicted values
match the true values of the dataset.
→ The model with a good fit is between the underfitted and overfitted model, and ideally, it makes
predictions with 0 errors, but in practice, it is difficult to achieve it.
→ As when we train our model for a time, the errors in the training data go down, and the same happens
with test data. But if we train the model for a long duration, then the performance of the model may
decrease due to the overfitting, as the model also learn the noise present in the dataset.
→ The errors in the test dataset start increasing, so the point, just before the raising of errors, is the good
point, and we can stop here for achieving a good model. There are two other methods by which we can
get a good point for our model, which are the resampling method to estimate model accuracy and
validation dataset.

Prepared by S. Abinaya AP/CSE (SRRCET)


Page 29

Cross validation:
→ Cross−validation is a technique for validating the model efficiency by training it on the subset of input
data and testing on previously unseen subset of the input data. We can also say that it is a technique
to check how a statistical model generalizes to an independent dataset.
→ In machine learning, there is always the need to test the stability of the model. It means based only on
the training dataset; we can't fit our model on the training dataset. For this purpose, we reserve a
particular sample of the dataset, which was not part of the training dataset. After that, we test our model
on that sample before deployment, and this complete process comes under cross−validation. This is
something different from the general train−test split.
→ Hence the basic steps of cross−validations are:
o Reserve a subset of the dataset as a validation set.
o Provide the training to the model using the training dataset.
o Now, evaluate model performance using the validation set. If the model performs well with the
validation set, perform the further step, else check for the issues.
Methods used for Cross-Validation
There are some common methods that are used for cross−validation. These methods are given below:
→ Validation Set Approach
→ Leave−P−out cross−validation
→ Leave one out cross−validation
→ K−fold cross−validation
→ Stratified k−fold cross−validation
Validation Set Approach
→ We divide our input dataset into a training set and test or validation set in the validation set approach.
Both the subsets are given 50% of the dataset.
→ But it has one of the big disadvantages that we are just using a 50% dataset to train our model, so the
model may miss out to capture important information of the dataset. It also tends to give the underfitted
model.
Leave-P-out cross-validation
→ In this approach, the p datasets are left out of the training data. It means, if there are total n datapoints in
the original input dataset, then n−p data points will be used as the training dataset and the p data points as
the validation set. This complete process is repeated for all the samples, and the average error is
calculated to know the effectiveness of the model.
→ There is a disadvantage of this technique; that is, it can be computationally difficult for the large p.

Prepared by S. Abinaya AP/CSE (SRRCET)


Page 30

Leave one out cross-validation


→ This method is similar to the leave−p−out cross−validation, but instead of p, we need to take 1 dataset out
of training. It means, in this approach, for each learning set, only one datapoint is reserved, and the
remaining dataset is used to train the model. This process repeats for each datapoint. Hence for n
samples, we get n different training set and n test set.
→ It has the following features:
• In this approach, the bias is minimum as all the data points are used.
• The process is executed for n times; hence execution time is high.
• This approach leads to high variation in testing the effectiveness of the model as we iteratively
check against one data point.
K-Fold Cross-Validation
K−fold cross−validation approach divides the input dataset into K groups of samples of equal sizes.
These samples are called folds. For each learning set, the prediction function uses k−1 folds, and the rest of the
folds are used for the test set. This approach is a very popular CV approach because it is easy to understand, and
the output is less biased than other methods.
• The steps for k−fold cross−validation are:
• Split the input dataset into K groups
• For each group:
• Take one group as the reserve or test data set.
• Use remaining groups as the training dataset
• Fit the model on the training set and evaluate the performance of the model using the test set.
Stratified k-fold cross-validation
→ This technique is similar to k−fold cross−validation with some little changes. This approach works on
stratification concept, it is a process of rearranging the data to ensure that each fold or group is a good
representative of the complete dataset. To deal with the bias and variance, it is one of the best
approaches.
→ It can be understood with an example of housing prices, such that the price of some houses can be much
high than other houses. To tackle such situations, a stratified k−fold cross− validation technique is useful.
Holdout Method
→ This method is the simplest cross−validation technique among all. In this method, we need to remove a
subset of the training data and use it to get prediction results by training it on the rest part of the dataset.
→ The error that occurs in this process tells how well our model will perform with the unknown dataset.
Although this approach is simple to perform, it still faces the issue of high variance, and it also produces
misleading results sometimes.

Prepared by S. Abinaya AP/CSE (SRRCET)


Page 31

Comparison of Cross−validation to train/test split in Machine Learning


Train/test split: The input data is divided into two parts, that are training set and test set on a ratio of
70:30, 80:20, etc. It provides a high variance, which is one of the biggest disadvantages.
Training Data: The training data is used to train the model, and the dependent variable is known.
Test Data: The test data is used to make the predictions from the model that is already trained on the
training data. This has the same features as training data but not the part of that.
Cross-Validation dataset: It is used to overcome the disadvantage of train/test split by splitting the dataset
into groups of train/test splits, and averaging the result. It can be used if we want to optimize our model that has
been trained on the training dataset for the best performance. It is more efficient as compared to train/test split
as every observation is used for the training and testing both.
Limitations of Cross-Validation
There are some limitations of the cross−validation technique, which are given below:
→ For the ideal conditions, it provides the optimum output. But for the inconsistent data, it may produce a
drastic result. So, it is one of the big disadvantages of cross− validation, as there is no certainty of the
type of data in machine learning.
→ In predictive modelling, the data evolves over a period, due to which, it may face the differences
between the training set and validation sets. Such as if we create a model for the prediction of stock
market values, and the data is trained on the previous 5 years stock values, but the realistic future
values for the next 5 years may drastically different, so it is difficult to expect the correct output for
such situations.
5. Bias and Variance
Bias:
The error introduced by approximating a real-world problem, which may be complex, by a simplified
model. High bias leads to underfitting.
Example: Assuming a linear relationship when the data is clearly non-linear.
Variance:
The error introduced by the model's sensitivity to small fluctuations in the training data. High variance
leads to overfitting.
Example: A very complex model that perfectly fits the training data, including its noise, will likely have high
variance and perform poorly on new data.
Bias-Variance Trade-off: There is an inherent trade-off between bias and variance. Reducing bias often
increases variance, and vice versa. The goal is to find a model that balances these two errors for optimal
generalization performance.

Prepared by S. Abinaya AP/CSE (SRRCET)


Page 32

XII. REGRESSION:

Regression modeling: Predicting continuous values.

1. LINEAR REGRESSION:

→ Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a statistical
method that is used for predictive analysis. Linear regression makes predictions for continuous/real or
numeric variables such as sales, salary, age, product price, etc.
→ Linear regression algorithm shows a linear relationship between a dependent (y) and one or more
independent (y) variables, hence called as linear regression. Since linear regression shows the linear
relationship, which means it finds how the value of the dependent variable is changing according to the
value of the independent variable.
The linear regression model provides a sloped straight line representing the relationship
between the variables. Consider the below image:

Mathematically, we can represent a linear regression as:


y= a0+a1x+ ε
Here,
Y= Dependent Variable (Target Variable)
X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of freedom)
a1 = Linear regression coefficient (scale factor to each input value).
ε = random error
The values for x and y variables are training datasets for Linear Regression model representation.
Types of Linear Regression
Linear regression can be further divided into two types of the algorithm:
o Simple Linear Regression:
If a single independent variable is used to predict the value of a numerical dependent
variable, then such a Linear Regression algorithm is called Simple Linear Regression.

Prepared by S. Abinaya AP/CSE (SRRCET)


Page 33

o Multiple Linear regression:


If more than one independent variable is used to predict the value of a numerical dependent
variable, then such a Linear Regression algorithm is called Multiple Linear Regression.
Linear Regression Line
A linear line showing the relationship between the dependent and independent variables is called
a regression line. A regression line can show two types of relationship:
Positive Linear Relationship:
If the dependent variable increases on the Y−axis and independent variable increases on
X−axis, then such a relationship is termed as a Positive linear relationship.

Negative Linear Relationship:


If the dependent variable decreases on the Y−axis and independent variable increases on the X−axis,
then such a relationship is called a negative linear relationship.

Finding the best fit line:


→ When working with linear regression, our main goal is to find the best fit line that means the error
between predicted values and actual values should be minimized. The best fit line will have the least
error.
→ The different values for weights or the coefficient of lines (a0, a1) gives a different line of regression,
so we need to calculate the best values for a0 and a1 to find the best fit line, so to calculate this we use
cost function.

Prepared by S. Abinaya AP/CSE (SRRCET)


Page 34

Cost function
→ The different values for weights or coefficient of lines (a0, a1) gives the different line of regression,
and the cost function is used to estimate the values of the coefficient for the best fit line.
→ Cost function optimizes the regression coefficients or weights. It measures how a linear regression
model is performing.
→ We can use the cost function to find the accuracy of the mapping function, which maps the input
variable to the output variable. This mapping function is also known as Hypothesis function.
→ For Linear Regression, we use the Mean Squared Error (MSE) cost function, which is the average
of squared error occurred between the predicted values and actual values.
→ It can be written as:
For the above linear equation, MSE can be calculated as:

where,
N=Total number of observation
Yi = Actual value
(a1xi+a0)= Predicted value.
Residuals: The distance between the actual value and predicted values is called residual. If the observed points
are far from the regression line, then the residual will be high, and so cost function will high. If the scatter
points are close to the regression line, then the residual will be small and hence the cost function.
Gradient Descent:
→ Gradient descent is used to minimize the MSE by calculating the gradient of the cost function.
→ A regression model uses gradient descent to update the coefficients of the line by reducing the cost
function.
→ It is done by a random selection of values of coefficient and then iteratively update the values to
reach the minimum cost function.
Model Performance:
The Goodness of fit determines how the line of regression fits the set of observations. The process of
finding the best model out of various models is called optimization. It can be achieved by below method:
1. R-squared method:
→ R−squared is a statistical method that determines the goodness of fit.
→ It measures the strength of the relationship between the dependent and independent variables on a
scale of 0−100%.

Prepared by S. Abinaya AP/CSE (SRRCET)


Page 35

→ The high value of R−square determines the less difference between the predicted values and actual
values and hence represents a good model.
→ It is also called a coefficient of determination, or coefficient of multiple
determination for multiple regression.
→ It can be calculated from the below formula:

2. LEAST SQUARED METHOD


→ The least−squares regression method is a technique commonly used in Regression Analysis. It is a
mathematical method used to find the best fit line that represents the relationship between an
independent and dependent variable.
→ To understand the least−squares regression method lets get familiar with the concepts involved in
formulating the line of best fit.
What is the Line Of Best Fit?
→ Line of best fit is drawn to represent the relationship between 2 or more variables. To be more specific,
the best fit line is drawn across a scatter plot of data points in order to represent a relationship
between those data points.
→ Regression analysis makes use of mathematical methods such as least squares to obtain a definite
relationship between the predictor variable (s) and the target variable.
→ The least−squares method is one of the most effective ways used to draw the line of best fit. It is
based on the idea that the square of the errors obtained must be minimized to the most possible
extent and hence the name least squares method.
→ If we were to plot the best fit line that shows the depicts the sales of a company over a period of time,
it would look something like this:

Notice that the line is as close as possible to all the scattered data points. This is what an ideal best fit
line looks like. To better understand the whole process let’s see how to calculate the line using the Least
Squares Regression.

Prepared by S. Abinaya AP/CSE (SRRCET)


Page 36

Steps to calculate the Line of Best Fit


To start constructing the line that best depicts the relationship between variables in the data, we first need
to get our basics right. Take a look at the equation below:

Surely, you’ve come across this equation before. It is a simple equation that represents a straight line along 2-
Dimensional data, i.e. x−axis and y−axis. To better understand this, let’s break down the equation:
→ y: dependent variable
→ m: the slope of the line
→ x: independent variable
→ c: y−intercept
So, the aim is to calculate the values of slope, y−intercept and substitute the corresponding ‘x’ values
in the equation in order to derive the value of the dependent variable.
Let’s see how this can be done.
As an assumption, let’s consider that there are ‘n’ data points.
Step 1: Calculate the slope ‘m’ by using the following formula:

Step 2: Compute the y−intercept (the value of y at the point where the line crosses the yaxis):
Step 3: Substitute the values in the final equation:

Now let’s look at an example and see how you can use the least−squares regression method to compute the
line of best fit.
Least Squares Regression Example
Consider an example. Tom who is the owner of a retail shop, found the price of different T−shirts vs
the number of T−shirts sold at his shop over a period of one week. He tabulated this like shown below:

Prepared by S. Abinaya AP/CSE (SRRCET)


Page 37

Let us use the concept of least squares regression to find the line of best fit for the above data.
Step 1: Calculate the slope ‘m’ by using the following formula:

After you substitute the respective values, m = 1.518 approximately.


Step 2: Compute the y−intercept value
c = y - mx

After you substitute the respective values, c = 0.305 approximately.


Step 3: Substitute the values in the final equation
y = mx + c

Once you substitute the values, it should look something like this:
c = y – mx

Let’s construct a graph that represents the y=mx + c line of best fit:

Now Tom can use the above equation to estimate how many T−shirts of price $8 can he sell at the retail shop.
y = 1.518 x 8 + 0.305 = 12.45 T−shirts
This comes down to 13 T−shirts! That’s how simple it is to make predictions using Linear
Regression.

Prepared by S. Abinaya AP/CSE (SRRCET)


Page 38

3. LASSO REGRESSION
Whenever we hear the term "regression," two things that come to mind are linear regression and
logistic regression. Even though the logistic regression falls under the classification algorithms category still
it buzzes in our mind.
These two topics are quite famous and are the basic introduction topics in Machine Learning. There
are other types of regression, like
→ Lasso regression,
→ Ridge regression,
→ Polynomial regression,
→ Stepwise regression,
→ ElasticNet regression
The above-mentioned techniques are majorly used in regression kind of analytical problems. When we
increase the degree of freedom (increasing polynomials in the equation) for regression models, they tend
to overfit. Using the regularization techniques we can overcome the overfitting issue.
Two popular methods for that is lasso and ridge regression. In our ridge regression article we
explained the theory behind the ridge regression also we learned the implementation part in python.
What Is Regression?
Regression is a statistical technique used to determine the relationship between one dependent variable
and one or many independent variables. In simple words, a regression analysis will tell you how your result
varies for different factors.
For example,
What determines a person's salary?
Many factors, like educational qualification, experience, skills, job role, company, etc., play a role in salary.
You can use regression analysis to predict the dependent variable – salary using the mentioned factors.
y = mx+c
Do you remember this equation from our school days?
It is nothing but a linear regression equation. In the above equation, the dependent variable estimates the
independent variable.
In mathematical terms,
→ Y is the dependent value,
→ X is the independent value,
→ m is the slope of the line,
→ c is the constant value.
The same equation terms are called slighted differently in machine learning or the statistical world.

Prepared by S. Abinaya AP/CSE (SRRCET)


Page 39

→ Y is the predicted value,


→ X is feature value,
→ m is coefficients or weights,
→ c is the bias value.
The line in the above graph represents the linear regression model. You can see how well the model
fits the data. It looks like a good model, but sometimes the model fits the data too much, resulting in overfitting.

To create the line (red) using the actual value, the regression model will iterate and recalculate the
m(coefficient) and c (bias) values while trying to reduce the loss values with the proper loss function.
The model will have low bias and high variance due to overfitting. The model fit is good in the training
data, but it will not give good test data predictions. Regularization comes into play to tackle this issue.
What Is Regularization?
Regularization solves the problem of overfitting. Overfitting causes low model accuracy. It happens
when the model learns the data as well as the noises in the training set.
Noises are random datum in the training set which don't represent the actual properties of the data.
Y ≈ C0 + C1X1 + C2X2 + …+ CpXp
Y represents the dependent variable, X represents the independent variables and C represents the
coefficient estimates for different variables in the above linear regression equation.
The model fitting involves a loss function known as the sum of squares. The coefficients in the equation
are chosen in a way to reduce the loss function to a minimum value. Wrong coefficients get selected if there
is a lot of irrelevant data in the training set.
Definition Of Lasso Regression
→ Lasso regression is like linear regression, but it uses a technique "shrinkage" where the coefficients
of determination are shrunk towards zero. Linear regression gives you regression coefficients as
observed in the dataset.
→ The lasso regression allows you to shrink or regularize these coefficients to avoid overfitting and make
them work better on different datasets.

Prepared by S. Abinaya AP/CSE (SRRCET)


Page 40

→ This type of regression is used when the dataset shows high multicollinearity or when you want to
automate variable elimination and feature selection.
When To Use Lasso Regression?
Choosing a model depends on the dataset and the problem statement you are dealing with. It is essential
to understand the dataset and how features interact with each other.
Lasso regression penalizes less important features of your dataset and makes their respective
coefficients zero, thereby eliminating them. Thus it provides you with the benefit of feature selection and
simple model creation. So, if the dataset has high dimensionality and high correlation, lasso regression can be
used.
The Statistics of Lasso Regression

Statistics of lasso regression


→ d1, d2, d3, etc., represents the distance between the actual data points and the model line in the above
graph. Least−squares is the sum of squares of the distance between the points from the plotted curve.
→ In linear regression, the best model is chosen in a way to minimize the least−squares. While performing
lasso regression, we add a penalizing factor to the least−squares.
→ That is, the model is chosen in a way to reduce the below loss function to a minimal value.
D = least-squares + lambda * summation (absolute values of the magnitude of the coefficients)
→ Lasso regression penalty consists of all the estimated parameters. Lambda can be any value between
zero to infinity. This value decides how aggressive regularization is performed. It is usually chosen
using cross−validation.
→ Lasso penalizes the sum of absolute values of coefficients. As the lambda value increases, coefficients
decrease and eventually become zero. This way, lasso regression eliminates insignificant variables
from our model. Our regularized model may have a slightly high bias than linear regression but less
variance for future predictions.

Prepared by S. Abinaya AP/CSE (SRRCET)


Page 41

4. DECISION TREES
Tree based methods – Decision Trees
→ Tree−based machine learning methods are among the most commonly used supervised learning methods.
They are constructed by two entities; branches and nodes.
→ Tree−based ML methods are built by recursively splitting a training sample, using different features
from a dataset at each node that splits the data most effectively.
→ The splitting is based on learning simple decision rules inferred from the training data. Generally,
tree−based ML methods are simple and intuitive; to predict a class label or value, we start from the top
of the tree or the root and, using branches, go to the nodes by comparing features on the basis of which
will provide the best split.
→ Tree−based methods also use the mean for continuous variables or mode for categorical variables when
making predictions on training observations in the regions they belong to. Since the set of rules used to
segment the predictor space can be summarized in a visual representation with branches that show all
the possible outcomes, these approaches are commonly referred to as decision tree methods.
→ The methods are flexible and can be applied to either classification or regression problems.
Classification and Regression Trees (CART) is a commonly used term by Leo Breiman, referring to
the flexibility of the methods in solving both linear and non−linear predictive modeling problems.
Types of Decision Trees
Decision trees can be classified based on the type of target or response variable.
i. Classification Trees
The default type of decision trees, used when the response variable is categorical—i.e. predicting
whether a team will win or lose a game.
ii. Regression Trees
Used when the target variable is continuous or numerical in nature—i.e. predicting house prices
based on year of construction, number of rooms, etc.
Advantages of Tree-based Machine Learning Methods
1. Interpretability: Decision tree methods are easy to understand even for non− technical people.
2. The data type isn’t a constraint, as the methods can handle both categorical and numerical variables.
3. Data exploration — Decision trees help us easily identify the most significant variables and their
correlation.
Disadvantages of Tree-based Machine Learning Methods
1. Large decision trees are complex, time−consuming and less accurate in predicting outcomes.
2. Decision trees don’t fit well for continuous variables, as they lose important information when
segmenting the data into different regions.

Prepared by S. Abinaya AP/CSE (SRRCET)


Page 42

i) Root node — this represents the entire population or the sample, which gets divided into two or more
homogenous subsets.
ii) Splitting — subdividing a node into two or more sub−nodes.
iii) Decision node — this is when a sub−node is divided into further sub−nodes.
iv) Leaf/Terminal node — this is the final/last node that we consider for our model output. It cannot be split
further.
v) Pruning — removing unnecessary sub−nodes of a decision node to combat overfitting.
vi) Branch/Sub-tree — the sub−section of the entire tree.
vii) Parent and Child node — a node that’s subdivided into a sub−node is a parent, while the sub−node is
the child node.

CART (Classification and Regression Tree (CART)


CART (Classification And Regression Tree) is a variation of the decision tree algorithm. It can
handle both classification and regression tasks. Scikit−Learn uses the Classification And Regression Tree
(CART) algorithm to train Decision Trees (also called “growing” trees). CART was first produced by Leo
Breiman, Jerome Friedman, Richard Olshen, and Charles Stone in 1984.
CART Algorithm
→ CART is a predictive algorithm used in Machine learning and it explains how the target variable’s
values can be predicted based on other matters.
→ It is a decision tree where each fork is split into a predictor variable and each node has a prediction for
the target variable at the end.
→ In the decision tree, nodes are split into sub−nodes on the basis of a threshold value of an attribute.
→ The root node is taken as the training set and is split into two by considering the best attribute and
threshold value.

Prepared by S. Abinaya AP/CSE (SRRCET)


Page 43

→ Further, the subsets are also split using the same logic. This continues till the last pure sub−set is found
in the tree or the maximum number of leaves possible in that growing tree.
The CART algorithm works via the following process:
→ The best split point of each input is obtained.
→ Based on the best split points of each input in Step 1, the new “best” split point is identified.
→ Split the chosen input according to the “best” split point.
→ Continue splitting until a stopping rule is satisfied or no further desirable splitting is available.

CART algorithm uses Gini Impurity to split the dataset into a decision tree .It does that by searching
for the best homogeneity for the sub nodes, with the help of the Gini index criterion.
Gini index/Gini impurity
→ The Gini index is a metric for the classification tasks in CART.
→ It stores the sum of squared probabilities of each class.
→ It computes the degree of probability of a specific variable that is wrongly being classified when chosen
randomly and a variation of the Gini coefficient.
→ It works on categorical variables, provides outcomes either “successful” or “failure” and hence
conducts binary splitting only.
The degree of the Gini index varies from 0 to 1,
→ Where 0 depicts that all the elements are allied to a certain class, or only one class exists there.
→ The Gini index of value 1 signifies that all the elements are randomly distributed across various classes &
→ A value of 0.5 denotes the elements are uniformly distributed into some classes.
Mathematically, we can write Gini Impurity as follows:

where pi is the probability of an object being classified to a particular class.

Prepared by S. Abinaya AP/CSE (SRRCET)


Page 44

Classification tree
A classification tree is an algorithm where the target variable is categorical. The algorithm is then used
to identify the “Class” within which the target variable is most likely to fall. Classification trees are used when
the dataset needs to be split into classes that belong to the response variable (like yes or no).
Regression tree
A Regression tree is an algorithm where the target variable is continuous and the tree is used to predict
its value. Regression trees are used when the response variable is continuous. For example, if the response
variable is the temperature of the day.
CART model representation
CART models are formed by picking input variables and evaluating split points on those variables until an
appropriate tree is produced.
Steps to create a Decision Tree using the CART algorithm:
→ Greedy algorithm: In this, The input space is divided using the Greedy method which is known as a
recursive binary spitting. This is a numerical method within which all of the values are aligned and
several other split points are tried and assessed using a cost function.
→ Stopping Criterion: As it works its way down the tree with the training data, the recursive binary
splitting method described above must know when to stop splitting. The most frequent halting method
is to utilize a minimum amount of training data allocated to every leaf node. If the count is smaller than
the specified threshold, the split is rejected and also the node is considered the last leaf node.
→ Tree pruning: Decision tree’s complexity is defined as the number of splits in the tree. Trees with
fewer branches are recommended as they are simple to grasp and less prone to cluster the data. Working
through each leaf node in the tree and evaluating the effect of deleting it using a hold−out test set is the
quickest and simplest pruning approach.
→ Data preparation for the CART: No special data preparation is required for the CART algorithm.
Advantages of CART
→ Results are simplistic.
→ Classification and regression trees are Nonparametric and Nonlinear.
→ Classification and regression trees implicitly perform feature selection.
→ Outliers have no meaningful effect on CART.
→ It requires minimal supervision and produces easy−to−understand models.
Limitations of CART
→ Overfitting.
→ High Variance.
→ low bias.
→ the tree structure may be unstable.

Prepared by S. Abinaya AP/CSE (SRRCET)


Page 45

Applications of the CART algorithm


→ For quick Data insights.
→ In Blood Donors Classification.
→ For environmental and ecological data.
→ In the financial sectors.

XIII. LOGISTIC REGRESSION

Prepared by S. Abinaya AP/CSE (SRRCET)

You might also like