Ocs351 Unit III
Ocs351 Unit III
→ AI (Artificial Intelligence) is a machine’s ability to perform cognitive functions as humans do, such
as perceiving, learning, reasoning, and solving problems. The benchmark for AI is the human level
concerning in teams of reasoning, speech, and vision.
→ Machine learning is important because it gives enterprises a view of trends in customer behavior
and business operational patterns, as well as supports the development of new products. Many of
today's leading companies, such as Facebook, Google and Uber, make machine learning a central
part of their operations.
The life of Machine Learning programs is straightforward and can be summarized in the following points:
1. Define a question
2. Collect data
3. Visualize data
4. Train algorithm
5. Test the Algorithm
6. Collect feedback
II. CLASSIFICATION
Classification
→ As the name suggests, Classification is the task of “classifying things” into sub− categories. But, by a
machine! If that doesn’t sound like much, imagine your computer being able to differentiate between
you and a stranger. Between a potato and a tomato. Between an A grade and an F. Now, it sounds
interesting now.
→ In Machine Learning and Statistics, Classification is the problem of identifying to which of a set of
categories (subpopulations), a new observation belongs, on the basis of a training set of data containing
observations and whose categories membership is known.
Types of Classification
Classification is of two types:
Binary Classification: When we have to categorize given data into 2 distinct classes. Example – On
the basis of given health conditions of a person, we have to determine whether the person has a certain
disease or not.
Multiclass Classification: The number of classes is more than 2. For Example – On the basis of data
about different species of flowers, we have to determine which specie our observation belongs.
Fig: Binary and Multiclass Classification. Here x1 and x2 are the variables upon which the class is predicted.
3. AUC-ROC curve:
o ROC curve stands for Receiver Operating Characteristics Curve and AUC stands for
Area Under the Curve.
o It is a graph that shows the performance of the classification model at different thresholds.
o To visualize the performance of the multi−class classification model, we use the AUC− ROC
Curve.
o The ROC curve is plotted with TPR and FPR, where TPR (True Positive Rate) on Y−axis and
FPR (False Positive Rate) on X−axis.
Use cases of Classification Algorithms
Classification algorithms can be used in different places. Below are some popular use cases of
Classification Algorithms:
→ Email Spam Detection
→ Speech Recognition
→ Identifications of Cancer tumor cells.
→ Drugs Classification
→ Biometric Identification, etc.
III. REGRESSION
→ Regression in machine learning refers to a supervised learning technique where the goal is to
predict a continuous numerical value based on one or more independent features. It finds relationships
between variables so that predictions can be made. we have two types of variables present in
regression:
Dependent Variable (Target): The variable we are trying to predict e.g house price.
Independent Variables (Features): The input variables that influence the prediction e.g
locality, number of rooms.
Regression analysis problem works with if output variable is a real or continuous value such as “salary” or
“weight”. Many different regression models can be used but the simplest model in them is linear regression.
Types of Regression
Regression can be classified into different types based on the number of predictor variables and the
nature of the relationship between variables:
1. Simple Linear Regression
o Linear regression is one of the simplest and most widely used statistical models. This
assumes that there is a linear relationship between the independent and dependent variables.
o This means that the change in the dependent variable is proportional to the change in the
independent variables. For example, predicting the price of a house based on its size.
2. Multiple Linear Regression
o Multiple linear regression extends simple linear regression by using multiple independent
variables to predict target variable. For example, predicting the price of a house based on
multiple features such as size, location, number of rooms, etc.
3. Polynomial Regression
o Polynomial regression is used to model with non-linear relationships between the dependent
variable and the independent variables.
o It adds polynomial terms to the linear regression model to capture more complex
relationships. For example, when we want to predict a non-linear trend like population growth
over time, we use polynomial regression.
4. Ridge & Lasso Regression
o Ridge & lasso regression are regularized versions of linear regression that help avoid
overfitting by penalizing large coefficients. When there’s a risk of overfitting due to too many
features we use these type of regression algorithms.
5. Support Vector Regression (SVR)
o SVR is a type of regression algorithm that is based on the Support Vector Machine
(SVM) algorithm.
o SVM is a type of algorithm that is used for classification tasks but it can also be used for
regression tasks.
o SVR works by finding a hyperplane that minimizes the sum of the squared residuals between
the predicted and actual values.
6. Decision Tree Regression
o Decision tree Uses a tree-like structure to make decisions where each branch of tree
represents a decision and leaves represent outcomes.
o For example, predicting customer behaviour based on features like age, income, etc there we
use decision tree regression.
7. Random Forest Regression
o Random Forest is an ensemble method that builds multiple decision trees and each tree is
trained on a different subset of the training data. The final prediction is made by averaging the
predictions of all of the trees. For example, customer churn or sales data using this.
Regression Evaluation Metrics:
Evaluation in machine learning measures the performance of a model. Here are some popular
evaluation metrics for regression:
Mean Absolute Error (MAE): The average absolute difference between the predicted and actual
values of the target variable.
Mean Squared Error (MSE): The average squared difference between the predicted and actual
values of the target variable.
Root Mean Squared Error (RMSE): Square root of the mean squared error.
Huber Loss: A hybrid loss function that transitions from MAE to MSE for larger errors, providing
balance between robustness and MSE’s sensitivity to outliers.
We as a human can adopt any of the two different approaches to machine learning models while learning an
artificial language. These two models have not previously been explored in human learning. However, it is
related to known effects of causal direction, classification vs. inference learning, and observational vs.
feedback learning.
Problem Formulation
Suppose we are working on a classification problem where our task is to decide if an email is a
spam or not spam based on the words present in a particular email. To solve this problem, we have a joint
model over
• Labels: Y=y, and
• Features: X = {x1, x2, …xn}
Therefore, the joint distribution of the model can be represented as
p(Y, X) = P(y, x1, x2,…xn)
Now, our goal is to estimate the probability of spam email i.e, P(Y=1|X). Both generative and discriminative
models can solve this problem but in different ways.
Let’s see why and how they are different!
The approach of Generative Models
In the case of generative models, to find the conditional probability P(Y|X), they estimate the prior
probability P(Y) and likelihood probability P(X|Y) with the help of the training data and uses the Bayes
Theorem to calculate the posterior probability P(Y |X):
conditional probability and don’t make any assumptions about the data points. But these models
are not capable of generating new data points. Therefore, the ultimate objective of discriminative
models is to separate one class from another.
→ If we have some outliers present in the dataset, then discriminative models work better compared
to generative models i.e., discriminative models are more robust to outliers. However, there is one
major drawback of these models is the misclassification problem, i.e., wrongly classifying a data
point.
algorithms tend to model the underlying patterns or distribution of the data points. These models use the
concept of joint probability and create the instances where a given feature (x) or input and the desired output
or label (y) exist at the same time.
These models use probability estimates and likelihood to model data points and differentiate
between different class labels present in a dataset. Unlike discriminative models, these models are also
capable of generating new data points.
V. TYPES OF LEARNING
The main types of learning in machine learning are categorized based on how the model learns from data and
the nature of the data itself. These include:
• Supervised Learning:
o Description: The model learns from labeled data, where each input example is paired with its
corresponding correct output. The goal is to learn a mapping from inputs to outputs so that the
model can predict outputs for new, unseen inputs.
o Examples: Classification (predicting categories, e.g., spam detection) and Regression
(predicting continuous values, e.g., house price prediction).
• Unsupervised Learning:
o Description: The model learns from unlabeled data, aiming to discover hidden patterns,
structures, or relationships within the data without explicit guidance.
o Examples: Clustering (grouping similar data points, e.g., customer segmentation) and
Dimensionality Reduction (reducing the number of features while retaining important
information).
• Reinforcement Learning:
o Description: An agent learns to make decisions by interacting with an environment. It receives
rewards for desirable actions and penalties for undesirable ones, aiming to maximize
cumulative reward over time.
o Examples: Game playing (e.g., AlphaGo) and Robotics (teaching robots to perform tasks).
• Semi-supervised Learning:
o Description: This approach combines aspects of both supervised and unsupervised learning. It
utilizes a small amount of labeled data along with a larger amount of unlabeled data to train a
model.
o Examples: Text classification with limited labeled documents, image recognition.
• Self-supervised Learning:
o Description: A subset of unsupervised learning where the model generates its own labels from
the input data itself, effectively creating a supervised learning task from unlabeled data.
o Examples: Pre-training large language models (like BERT or GPT) by predicting masked
words or next sentences.
5. Deep Learning:
• Concept:
A subfield of machine learning that utilizes artificial neural networks with multiple layers (deep neural
networks) to learn complex patterns from large datasets. Deep learning models can be applied to
supervised, unsupervised, and reinforcement learning tasks.
• Examples:
o Convolutional Neural Networks (CNNs): Primarily used for image and video analysis.
o Recurrent Neural Networks (RNNs) and Transformers: Primarily used for sequential data
like natural language processing (NLP) and time series.
Bayes Theorem
→ Bayesian decision theory refers to the statistical approach based on trade-off quantification among
various classification decisions based on the concept of Probability(Bayes Theorem) and the costs
associated with the decision.
→ It is basically a classification technique that involves the use of the Bayes Theorem which is used to
find the conditional probabilities.
→ The Bayes theorem describes the probability of an event based on the prior knowledge of the conditions
that might be related to the event. The conditional probability of A given B, represented by P(A | B) is
the chance of occurrence of A given that B has occurred.
P(A | B) = P(A,B)/P(B) or
By Using the Chain rule, this can also be written as:
P(A, B) = P(A|B)P(B)=P(B|A)P(A)
P(A | B) = P(B|A)P(A)/P(B) —— (1)
Where,
P(B) = P(B,A) + P(B,A’) = P(B|A)P(A) + P(B|A’)P(A’)
Here, equation (1) is known as the Bayes Theorem of probability
Our aim is to explore each of the components included in this theorem. Let’s explore step by step:
a) Prior or State of Nature:
→ Prior probabilities represent how likely is each Class is going to occur.
→ Priors are known before the training process.
→ The state of nature is a random variable P(wi).
→ If there are only two classes, then the sum of the priors is P(w1) + P(w2)=1, if the classes are
exhaustive.
b) Class Conditional Probabilities:
→ It represents the probability of how likely a feature x occurs given that it belongs to the particular class.
It is denoted by, P(X|A) where x is a particular feature
→ It is the probability of how likely the feature x occurs given that it belongs to the class wi.
→ Sometimes, it is also known as the Likelihood.
→ It is the quantity that we have to evaluate while training the data. During the training process, we have
input(features) X labeled to corresponding class w and we figure out the likelihood of occurrence of that
set of features given the class label.
Evidence:
→ It is the probability of occurrence of a particular feature i.e. P(X).
→ It can be calculated using the chain rule as, P(X) = Σin P(X | wi) P(wi).
→ As we need the likelihood of class conditional probability is also figure out evidence values during
training.
Posterior Probabilities:
→ It is the probability of occurrence of Class A when certain Features are given
→ It is what we aim at computing in the test phase in which we have testing input/features (the given entity)
& have to find how likely trained model can predict features belonging to the particular class wi.
Geometric Vector
This is a vector, but not the kind of vector discussed in the Linear Algebra for Machine Learning.
Instead, it would be this image below we would talk about.
What we had above is also a Vector, but another kind of vector. You might be familiar with matrix form (the
image below). The vector is a matrix with only 1 column, which is known as a column vector. In other words,
we can think of a matrix as a group of column vectors or row vectors. In summary, vectors are special objects
that can be added together and multiplied by scalars to produce another object of the same kind. We could have
various objects called vectors.
Matrix
→ Linear algebra itself s a systematic representation of data that computers can understand, and all the
operations in linear algebra are systematic rules. That is why in modern time machine learning, Linear
algebra is an important study.
→ An example of how linear algebra is used is in the linear equation. Linear algebra is a tool used in the
Linear Equation because so many problems could be presented systematically in a Linear way. The
typical Linear equation is presented in the form below.
Linear Equation
To solve the linear equation problem above, we use Linear Algebra to present the linear equation in a
systematical representation. This way, we could use the matrix characterization to look for the most optimal
solution.
Cartesian Coordinate
Above is an example of how we acquired information from the data point by projecting the dataset into
the plane. How we acquire the information from this representation is the heart of Analytical Geometry. To
help you start learning this subject, here are some important terms you might need.
Distance Function
A distance function is a function that provides numerical information for the distance between the elements
of a set. If the distance is zero, then elements are equivalent. Else, they are different from each other.
An example of the distance function is Euclidean Distance which calculates the linear distance
between two data points.
→ The concept of a hypothesis is fundamental in Machine Learning and data science endeavours. In the
realm of machine learning, a hypothesis serves as an initial assumption made by data scientists and
ML professionals when attempting to address a problem. Machine learning involves conducting
experiments based on past experiences, and these hypotheses are crucial in formulating potential
solutions.
Hypothesis in Machine Learning
A hypothesis in machine learning is the model's presumption regarding the connection between
the input features and the result. It is an illustration of the mapping function that the algorithm is attempting
to discover using the training set. To minimize the discrepancy between the expected and actual outputs, the
learning process involves modifying the weights that parameterize the hypothesis. The objective is to optimize
the model's parameters to achieve the best predictive performance on new, unseen data, and a cost function is
used to assess the hypothesis' accuracy.
How does a Hypothesis work?
In most supervised machine learning algorithms, our main goal is to find a possible hypothesis from
the hypothesis space that could map out the inputs to the proper outputs. The following figure shows the
common method to find out the possible hypothesis from the Hypothesis space:
Hypothesis Space (H)
Hypothesis space is the set of all the possible legal hypothesis. This is the set from which the machine
learning algorithm would determine the best possible (only one) which would best describe the target function
or the outputs.
Hypothesis (h)
A hypothesis is a function that best describes the target in supervised machine learning. The hypothesis
that an algorithm would come up depends upon the data and also depends upon the restrictions and bias that
we have imposed on the data.
To better understand the Hypothesis Space and Hypothesis consider the following coordinate that
shows the distribution of some data:
Say suppose we have test data for which we have to determine the outputs or results. The test data is as shown
below:
But note here that we could have divided the coordinate plane as:
→ The way in which the coordinate would be divided depends on the data, algorithm and
constraints.
→ All these legal possible ways in which we can divide the coordinate plane to predict the
outcome of the test data composes of the Hypothesis Space.
→ Each individual possible way is known as the hypothesis.
→ Hence, in this example the hypothesis space would be like:
Hypothesis Evaluation:
The process of machine learning involves not only formulating hypotheses but also evaluating their
performance. This evaluation is typically done using a loss function or an evaluation metric that quantifies the
disparity between predicted outputs and ground truth labels. Common evaluation metrics include mean
squared error (MSE), accuracy, precision, recall, F1-score, and others. By comparing the predictions of the
hypothesis with the actual outcomes on a validation or test dataset, one can assess the effectiveness of the
model.
Hypothesis Testing and Generalization:
Once a hypothesis is formulated and evaluated, the next step is to test its generalization capabilities.
Generalization refers to the ability of a model to make accurate predictions on unseen data. A hypothesis that
performs well on the training dataset but fails to generalize to new instances is said to suffer from overfitting.
Conversely, a hypothesis that generalizes well to unseen data is deemed robust and reliable.
The process of hypothesis formulation, evaluation, testing, and generalization is often iterative in nature. It
involves refining the hypothesis based on insights gained from model performance, feature importance, and
domain knowledge. Techniques such as hyperparameter tuning, feature engineering, and model selection play
a crucial role in this iterative refinement process.
Hypothesis in Statistics
In statistics, a hypothesis refers to a statement or assumption about a population parameter. It is a
proposition or educated guess that helps guide statistical analyses. There are two types of hypotheses: the null
hypothesis (H0) and the alternative hypothesis (H1 or Ha).
Null Hypothesis(H0): This hypothesis suggests that there is no significant difference or effect, and any
observed results are due to chance. It often represents the status quo or a baseline assumption.
Aternative Hypothesis(H1 or Ha): This hypothesis contradicts the null hypothesis, proposing that there is a
significant difference or effect in the population. It is what researchers aim to support with evidence.
→ Additionally, the choice of inductive bias can impact the interpretability of the model. Simpler biases
may lead to more interpretable models, while more complex biases may sacrifice interpretability for
improved performance.
Inductive bias is a fundamental concept in machine learning that shapes how algorithms learn and
generalize from data. It serves as a guiding principle that influences the selection of hypotheses and the
generalization of models to unseen data. Understanding the inductive bias of an algorithm is essential for
model development, selection, and interpretation, as it provides insights into how the algorithm is learning
and making predictions. By carefully considering and balancing inductive bias, machine learning practitioners
can develop models that generalize well and provide valuable insights into complex datasets.
X. EVALUATION
→ Evaluation in machine learning is the process of assessing the performance of a trained model or
hypothesis. This is crucial for understanding how well the model generalizes to new, unseen data and
for comparing different models.
→ Evaluation typically involves:
Splitting Data: Dividing the available dataset into training, validation, and test sets. The model
is trained on the training set, hyper-parameters are tuned using the validation set, and the final
performance is measured on the unseen test set.
Metrics: Using appropriate metrics to quantify performance.
Example: For classification, metrics like accuracy, precision, recall, F1-score, or AUC-ROC are
used. For regression, metrics like Mean Squared Error (MSE), Root Mean Squared Error (RMSE), or R-
squared are common.
Employing techniques like k-fold cross-validation to get a more robust estimate of the model's
performance, especially when data is limited.
Example: After training a classification model, we can evaluate its performance by calculating its
accuracy on a held-out test set. If the accuracy is 85%, it indicates that the model correctly classifies 85% of
the examples in the test set.
XI. TRAINING AND TEST SETS, CROSS VALIDATION, CONCEPT OF OVER FITTING, UNDER
FITTING, BIAS AND VARIANCE.
1. Training and Test Sets
Training Set:
This is the portion of the dataset used to train the machine learning model. The model learns patterns
and relationships from this data.
Test Set:
This is the unseen portion of the dataset used to evaluate the model's performance on new, unobserved
data. It assesses how well the model generalizes.
Example: Imagine training a model to predict house prices. The training set would contain features (size,
location, number of rooms) and corresponding prices for houses the model learns from. The test set would
contain similar features for houses the model hasn't seen, and its predictions would be compared to the actual
prices to gauge accuracy.
2. Cross-Validation
Cross-validation is a technique to estimate a model's performance and robustness more reliably than a
single train-test split. It involves partitioning the data into multiple subsets (folds).
K-Fold Cross-Validation: The dataset is divided into 'k' equal-sized folds. The model is trained 'k' times,
each time using 'k-1' folds for training and one fold for testing. The results are then averaged.
Example: For a 5-fold cross-validation, the data is split into 5 parts. In the first iteration, folds 2-5 are
used for training, and fold 1 for testing. In the second, folds 1, 3-5 train, and fold 2 tests, and so on.
3. Overfitting and Underfitting
Overfitting occurs when a model learns the training data too well, including its noise and outliers,
leading to poor performance on new, unseen data.
Example: A model trained to classify images of cats and dogs overfits if it perfectly identifies the cats
and dogs in the training set but fails to recognize new cat and dog images because it memorized specific
features of the training images instead of learning general characteristics.
Underfitting occurs when a model is too simple to capture the underlying patterns in the training data,
resulting in poor performance on both training and test data.
Example: Using a simple linear regression model to predict a non-linear relationship (e.g., a curved
trend) between two variables would likely underfit, as the model cannot capture the complexity of the data.
→ Overfitting and Underfitting are the two main problems that occur in machine learning and degrade the
performance of the machine learning models.
→ The main goal of each machine learning model is to generalize well. Here generalization
defines the ability of an ML model to provide a suitable output by adapting the given set of unknown
input. It means after providing training on the dataset, it can produce reliable and accurate output.
→ Hence, the underfitting and overfitting are the two terms that need to be checked for the performance
of the model and whether the model is generalizing well or not.
Before understanding the overfitting and underfitting, let's understand some basic term that will help to
understand this topic well:
Signal: It refers to the true underlying pattern of the data that helps the machine learning model to
learn from the data.
Noise: Noise is unnecessary and irrelevant data that reduces the performance of the model.
Bias: Bias is a prediction error that is introduced in the model due to oversimplifying the machine
learning algorithms. Or it is the difference between the predicted values and the actual values.
Variance: If the machine learning model performs well with the training dataset, but does not perform
well with the test dataset, then variance occurs.
Overfitting
→ Overfitting occurs when our machine learning model tries to cover all the data points or more than the
required data points present in the given dataset. Because of this, the model starts caching noise and
inaccurate values present in the dataset, and all these factors reduce the efficiency and accuracy of the
model. The overfitted model has low bias and high variance.
→ The chances of occurrence of overfitting increase as much we provide training to our model. It means
the more we train our model, the more chances of occurring the overfitted model.
→ Overfitting is the main problem that occurs in supervised learning.
Example: The concept of the overfitting can be understood by the below graph of the linear regression output:
As we can see from the above graph, the model tries to cover all the data points present in the scatter
plot. It may look efficient, but in reality, it is not so. Because the goal of the regression model to find the best fit
line, but here we have not got any best fit, so, it will generate the prediction errors.
How to avoid the Overfitting in Model
Both overfitting and underfitting cause the degraded performance of the machine learning model. But
the main cause is overfitting, so there are some ways by which we can reduce the occurrence of overfitting in
our model.
→ Cross−Validation
→ Training with more data
→ Removing features
→ Early stopping the training
→ Regularization
→ Ensembling
Underfitting
→ Underfitting occurs when our machine learning model is not able to capture the underlying trend of
the data.
→ To avoid the overfitting in the model, the fed of training data can be stopped at an early stage, due to
which the model may not learn enough from the training data. As a result, it may fail to find the best
fit of the dominant trend in the data.
→ In the case of underfitting, the model is not able to learn enough from the training data, and hence it
reduces the accuracy and produces unreliable predictions. An underfitted model has high bias and low
variance.
Example: We can understand the underfitting using below output of the linear regression model:
As we can see from the above diagram, the model is unable to capture the data points present in the plot.
How to avoid underfitting:
→ By increasing the training time of the model.
→ By increasing the number of features.
Goodness of Fit
→ The "Goodness of fit" term is taken from the statistics, and the goal of the machine learning models to
achieve the goodness of fit. In statistics modeling, it defines how closely the result or predicted values
match the true values of the dataset.
→ The model with a good fit is between the underfitted and overfitted model, and ideally, it makes
predictions with 0 errors, but in practice, it is difficult to achieve it.
→ As when we train our model for a time, the errors in the training data go down, and the same happens
with test data. But if we train the model for a long duration, then the performance of the model may
decrease due to the overfitting, as the model also learn the noise present in the dataset.
→ The errors in the test dataset start increasing, so the point, just before the raising of errors, is the good
point, and we can stop here for achieving a good model. There are two other methods by which we can
get a good point for our model, which are the resampling method to estimate model accuracy and
validation dataset.
Cross validation:
→ Cross−validation is a technique for validating the model efficiency by training it on the subset of input
data and testing on previously unseen subset of the input data. We can also say that it is a technique
to check how a statistical model generalizes to an independent dataset.
→ In machine learning, there is always the need to test the stability of the model. It means based only on
the training dataset; we can't fit our model on the training dataset. For this purpose, we reserve a
particular sample of the dataset, which was not part of the training dataset. After that, we test our model
on that sample before deployment, and this complete process comes under cross−validation. This is
something different from the general train−test split.
→ Hence the basic steps of cross−validations are:
o Reserve a subset of the dataset as a validation set.
o Provide the training to the model using the training dataset.
o Now, evaluate model performance using the validation set. If the model performs well with the
validation set, perform the further step, else check for the issues.
Methods used for Cross-Validation
There are some common methods that are used for cross−validation. These methods are given below:
→ Validation Set Approach
→ Leave−P−out cross−validation
→ Leave one out cross−validation
→ K−fold cross−validation
→ Stratified k−fold cross−validation
Validation Set Approach
→ We divide our input dataset into a training set and test or validation set in the validation set approach.
Both the subsets are given 50% of the dataset.
→ But it has one of the big disadvantages that we are just using a 50% dataset to train our model, so the
model may miss out to capture important information of the dataset. It also tends to give the underfitted
model.
Leave-P-out cross-validation
→ In this approach, the p datasets are left out of the training data. It means, if there are total n datapoints in
the original input dataset, then n−p data points will be used as the training dataset and the p data points as
the validation set. This complete process is repeated for all the samples, and the average error is
calculated to know the effectiveness of the model.
→ There is a disadvantage of this technique; that is, it can be computationally difficult for the large p.
XII. REGRESSION:
1. LINEAR REGRESSION:
→ Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a statistical
method that is used for predictive analysis. Linear regression makes predictions for continuous/real or
numeric variables such as sales, salary, age, product price, etc.
→ Linear regression algorithm shows a linear relationship between a dependent (y) and one or more
independent (y) variables, hence called as linear regression. Since linear regression shows the linear
relationship, which means it finds how the value of the dependent variable is changing according to the
value of the independent variable.
The linear regression model provides a sloped straight line representing the relationship
between the variables. Consider the below image:
Cost function
→ The different values for weights or coefficient of lines (a0, a1) gives the different line of regression,
and the cost function is used to estimate the values of the coefficient for the best fit line.
→ Cost function optimizes the regression coefficients or weights. It measures how a linear regression
model is performing.
→ We can use the cost function to find the accuracy of the mapping function, which maps the input
variable to the output variable. This mapping function is also known as Hypothesis function.
→ For Linear Regression, we use the Mean Squared Error (MSE) cost function, which is the average
of squared error occurred between the predicted values and actual values.
→ It can be written as:
For the above linear equation, MSE can be calculated as:
where,
N=Total number of observation
Yi = Actual value
(a1xi+a0)= Predicted value.
Residuals: The distance between the actual value and predicted values is called residual. If the observed points
are far from the regression line, then the residual will be high, and so cost function will high. If the scatter
points are close to the regression line, then the residual will be small and hence the cost function.
Gradient Descent:
→ Gradient descent is used to minimize the MSE by calculating the gradient of the cost function.
→ A regression model uses gradient descent to update the coefficients of the line by reducing the cost
function.
→ It is done by a random selection of values of coefficient and then iteratively update the values to
reach the minimum cost function.
Model Performance:
The Goodness of fit determines how the line of regression fits the set of observations. The process of
finding the best model out of various models is called optimization. It can be achieved by below method:
1. R-squared method:
→ R−squared is a statistical method that determines the goodness of fit.
→ It measures the strength of the relationship between the dependent and independent variables on a
scale of 0−100%.
→ The high value of R−square determines the less difference between the predicted values and actual
values and hence represents a good model.
→ It is also called a coefficient of determination, or coefficient of multiple
determination for multiple regression.
→ It can be calculated from the below formula:
Notice that the line is as close as possible to all the scattered data points. This is what an ideal best fit
line looks like. To better understand the whole process let’s see how to calculate the line using the Least
Squares Regression.
Surely, you’ve come across this equation before. It is a simple equation that represents a straight line along 2-
Dimensional data, i.e. x−axis and y−axis. To better understand this, let’s break down the equation:
→ y: dependent variable
→ m: the slope of the line
→ x: independent variable
→ c: y−intercept
So, the aim is to calculate the values of slope, y−intercept and substitute the corresponding ‘x’ values
in the equation in order to derive the value of the dependent variable.
Let’s see how this can be done.
As an assumption, let’s consider that there are ‘n’ data points.
Step 1: Calculate the slope ‘m’ by using the following formula:
Step 2: Compute the y−intercept (the value of y at the point where the line crosses the yaxis):
Step 3: Substitute the values in the final equation:
Now let’s look at an example and see how you can use the least−squares regression method to compute the
line of best fit.
Least Squares Regression Example
Consider an example. Tom who is the owner of a retail shop, found the price of different T−shirts vs
the number of T−shirts sold at his shop over a period of one week. He tabulated this like shown below:
Let us use the concept of least squares regression to find the line of best fit for the above data.
Step 1: Calculate the slope ‘m’ by using the following formula:
Once you substitute the values, it should look something like this:
c = y – mx
Let’s construct a graph that represents the y=mx + c line of best fit:
Now Tom can use the above equation to estimate how many T−shirts of price $8 can he sell at the retail shop.
y = 1.518 x 8 + 0.305 = 12.45 T−shirts
This comes down to 13 T−shirts! That’s how simple it is to make predictions using Linear
Regression.
3. LASSO REGRESSION
Whenever we hear the term "regression," two things that come to mind are linear regression and
logistic regression. Even though the logistic regression falls under the classification algorithms category still
it buzzes in our mind.
These two topics are quite famous and are the basic introduction topics in Machine Learning. There
are other types of regression, like
→ Lasso regression,
→ Ridge regression,
→ Polynomial regression,
→ Stepwise regression,
→ ElasticNet regression
The above-mentioned techniques are majorly used in regression kind of analytical problems. When we
increase the degree of freedom (increasing polynomials in the equation) for regression models, they tend
to overfit. Using the regularization techniques we can overcome the overfitting issue.
Two popular methods for that is lasso and ridge regression. In our ridge regression article we
explained the theory behind the ridge regression also we learned the implementation part in python.
What Is Regression?
Regression is a statistical technique used to determine the relationship between one dependent variable
and one or many independent variables. In simple words, a regression analysis will tell you how your result
varies for different factors.
For example,
What determines a person's salary?
Many factors, like educational qualification, experience, skills, job role, company, etc., play a role in salary.
You can use regression analysis to predict the dependent variable – salary using the mentioned factors.
y = mx+c
Do you remember this equation from our school days?
It is nothing but a linear regression equation. In the above equation, the dependent variable estimates the
independent variable.
In mathematical terms,
→ Y is the dependent value,
→ X is the independent value,
→ m is the slope of the line,
→ c is the constant value.
The same equation terms are called slighted differently in machine learning or the statistical world.
To create the line (red) using the actual value, the regression model will iterate and recalculate the
m(coefficient) and c (bias) values while trying to reduce the loss values with the proper loss function.
The model will have low bias and high variance due to overfitting. The model fit is good in the training
data, but it will not give good test data predictions. Regularization comes into play to tackle this issue.
What Is Regularization?
Regularization solves the problem of overfitting. Overfitting causes low model accuracy. It happens
when the model learns the data as well as the noises in the training set.
Noises are random datum in the training set which don't represent the actual properties of the data.
Y ≈ C0 + C1X1 + C2X2 + …+ CpXp
Y represents the dependent variable, X represents the independent variables and C represents the
coefficient estimates for different variables in the above linear regression equation.
The model fitting involves a loss function known as the sum of squares. The coefficients in the equation
are chosen in a way to reduce the loss function to a minimum value. Wrong coefficients get selected if there
is a lot of irrelevant data in the training set.
Definition Of Lasso Regression
→ Lasso regression is like linear regression, but it uses a technique "shrinkage" where the coefficients
of determination are shrunk towards zero. Linear regression gives you regression coefficients as
observed in the dataset.
→ The lasso regression allows you to shrink or regularize these coefficients to avoid overfitting and make
them work better on different datasets.
→ This type of regression is used when the dataset shows high multicollinearity or when you want to
automate variable elimination and feature selection.
When To Use Lasso Regression?
Choosing a model depends on the dataset and the problem statement you are dealing with. It is essential
to understand the dataset and how features interact with each other.
Lasso regression penalizes less important features of your dataset and makes their respective
coefficients zero, thereby eliminating them. Thus it provides you with the benefit of feature selection and
simple model creation. So, if the dataset has high dimensionality and high correlation, lasso regression can be
used.
The Statistics of Lasso Regression
4. DECISION TREES
Tree based methods – Decision Trees
→ Tree−based machine learning methods are among the most commonly used supervised learning methods.
They are constructed by two entities; branches and nodes.
→ Tree−based ML methods are built by recursively splitting a training sample, using different features
from a dataset at each node that splits the data most effectively.
→ The splitting is based on learning simple decision rules inferred from the training data. Generally,
tree−based ML methods are simple and intuitive; to predict a class label or value, we start from the top
of the tree or the root and, using branches, go to the nodes by comparing features on the basis of which
will provide the best split.
→ Tree−based methods also use the mean for continuous variables or mode for categorical variables when
making predictions on training observations in the regions they belong to. Since the set of rules used to
segment the predictor space can be summarized in a visual representation with branches that show all
the possible outcomes, these approaches are commonly referred to as decision tree methods.
→ The methods are flexible and can be applied to either classification or regression problems.
Classification and Regression Trees (CART) is a commonly used term by Leo Breiman, referring to
the flexibility of the methods in solving both linear and non−linear predictive modeling problems.
Types of Decision Trees
Decision trees can be classified based on the type of target or response variable.
i. Classification Trees
The default type of decision trees, used when the response variable is categorical—i.e. predicting
whether a team will win or lose a game.
ii. Regression Trees
Used when the target variable is continuous or numerical in nature—i.e. predicting house prices
based on year of construction, number of rooms, etc.
Advantages of Tree-based Machine Learning Methods
1. Interpretability: Decision tree methods are easy to understand even for non− technical people.
2. The data type isn’t a constraint, as the methods can handle both categorical and numerical variables.
3. Data exploration — Decision trees help us easily identify the most significant variables and their
correlation.
Disadvantages of Tree-based Machine Learning Methods
1. Large decision trees are complex, time−consuming and less accurate in predicting outcomes.
2. Decision trees don’t fit well for continuous variables, as they lose important information when
segmenting the data into different regions.
i) Root node — this represents the entire population or the sample, which gets divided into two or more
homogenous subsets.
ii) Splitting — subdividing a node into two or more sub−nodes.
iii) Decision node — this is when a sub−node is divided into further sub−nodes.
iv) Leaf/Terminal node — this is the final/last node that we consider for our model output. It cannot be split
further.
v) Pruning — removing unnecessary sub−nodes of a decision node to combat overfitting.
vi) Branch/Sub-tree — the sub−section of the entire tree.
vii) Parent and Child node — a node that’s subdivided into a sub−node is a parent, while the sub−node is
the child node.
→ Further, the subsets are also split using the same logic. This continues till the last pure sub−set is found
in the tree or the maximum number of leaves possible in that growing tree.
The CART algorithm works via the following process:
→ The best split point of each input is obtained.
→ Based on the best split points of each input in Step 1, the new “best” split point is identified.
→ Split the chosen input according to the “best” split point.
→ Continue splitting until a stopping rule is satisfied or no further desirable splitting is available.
CART algorithm uses Gini Impurity to split the dataset into a decision tree .It does that by searching
for the best homogeneity for the sub nodes, with the help of the Gini index criterion.
Gini index/Gini impurity
→ The Gini index is a metric for the classification tasks in CART.
→ It stores the sum of squared probabilities of each class.
→ It computes the degree of probability of a specific variable that is wrongly being classified when chosen
randomly and a variation of the Gini coefficient.
→ It works on categorical variables, provides outcomes either “successful” or “failure” and hence
conducts binary splitting only.
The degree of the Gini index varies from 0 to 1,
→ Where 0 depicts that all the elements are allied to a certain class, or only one class exists there.
→ The Gini index of value 1 signifies that all the elements are randomly distributed across various classes &
→ A value of 0.5 denotes the elements are uniformly distributed into some classes.
Mathematically, we can write Gini Impurity as follows:
Classification tree
A classification tree is an algorithm where the target variable is categorical. The algorithm is then used
to identify the “Class” within which the target variable is most likely to fall. Classification trees are used when
the dataset needs to be split into classes that belong to the response variable (like yes or no).
Regression tree
A Regression tree is an algorithm where the target variable is continuous and the tree is used to predict
its value. Regression trees are used when the response variable is continuous. For example, if the response
variable is the temperature of the day.
CART model representation
CART models are formed by picking input variables and evaluating split points on those variables until an
appropriate tree is produced.
Steps to create a Decision Tree using the CART algorithm:
→ Greedy algorithm: In this, The input space is divided using the Greedy method which is known as a
recursive binary spitting. This is a numerical method within which all of the values are aligned and
several other split points are tried and assessed using a cost function.
→ Stopping Criterion: As it works its way down the tree with the training data, the recursive binary
splitting method described above must know when to stop splitting. The most frequent halting method
is to utilize a minimum amount of training data allocated to every leaf node. If the count is smaller than
the specified threshold, the split is rejected and also the node is considered the last leaf node.
→ Tree pruning: Decision tree’s complexity is defined as the number of splits in the tree. Trees with
fewer branches are recommended as they are simple to grasp and less prone to cluster the data. Working
through each leaf node in the tree and evaluating the effect of deleting it using a hold−out test set is the
quickest and simplest pruning approach.
→ Data preparation for the CART: No special data preparation is required for the CART algorithm.
Advantages of CART
→ Results are simplistic.
→ Classification and regression trees are Nonparametric and Nonlinear.
→ Classification and regression trees implicitly perform feature selection.
→ Outliers have no meaningful effect on CART.
→ It requires minimal supervision and produces easy−to−understand models.
Limitations of CART
→ Overfitting.
→ High Variance.
→ low bias.
→ the tree structure may be unstable.