unit-3
unit-3
Linear Regression
Linear regression analysis is used to predict the value of a variable based on the value of another
variable. The variable you want to predict is called the dependent variable. The variable you are
using to predict the other variable's value is called the independent variable. Linear regression
makes predictions for continuous/real or numeric variables such as sales, salary, age, product
price, etc.
This form of analysis estimates the coefficients of the linear equation, involving one or more
independent variables that best predict the value of the dependent variable. Linear regression fits
a straight line or surface that minimizes the discrepancies between predicted and actual output
values.
When there is only one independent feature, it is known as Simple Linear Regression, and when
there are more than one feature, it is known as Multiple Linear Regression.
Similarly, when there is only one dependent variable, it is considered Univariate Linear
Regression, while when there are more than one dependent variables, it is known as Multivariate
Regression.
Equation of linear Regression is Y=mx+b.
M is the slop of the line.
X is independent variable.
B is the intercept.
Y is dependent variable.
Now we predicting the prices of the pizzas
Project: Predicting Pizza Prices
1. Data Collection
2. Calculations
3. Predictions
4. Visualizations
Table for predicting the pizza prices
8 (small) 10 10 13 -2 -3 6 12 4
10(medium) 13 0 0 0 0
12(large) 16 2 3 6 4
1
lop of the line is:
M=sum of product of deviations/sum of square of deviation for X
Formula of calculating the value of b
b= Mean of Y-(m*Mean of X)
Calculate Y=mx+b
Linear: If the relationship between the independent and dependent variables is linear, then we
can use a straight line to fit the given data.
If the relationship between the independent and dependent variables is not linear, then linear
regression cannot be used as it will result in large errors.
Examples:
1. Height and Age
Child Growth: The relationship between age and height in children is typically non-linear, as
growth rates vary with age, often modeled using a quadratic function.
4. Plant Growth
Watering vs. Plant Height: The relationship between the amount of water given to a plant and
its growth can be non-linear, with optimal watering leading to significant height increases, while
too much or too little water can stunt growth.
Polynomial Regression:
It can handle non-linear relationships among variables by using nth degree of a polynomial.
Polynomial regression can be directly used to deal with different levels of curvilinearity.
For example, the second-degree polynomial (called quadratic transformation) is given as:
y = α₁ + α₁x + a2x² and third degree polynomial is called cubic transformation given as:
y = a + a₁x + a2x² + α3x³. generally polynomial of maximum degree 4 are used, as higher order
polynomials take some strange shapes & make the curve more flexible. It leads to a situation of
overfitting & hence it is avoided.
Overfitting: when model is too complex and perform extremely well on training data but poorly
on new, unseen data.
Underfitting: when model is too simple and performs poorly on both the training data & new
unseen data.
• Feature selection
Nonlinear regression can be used for feature selection in prediction modeling.
• Building interpretability
Nonlinear regression can help build interpretability into prediction models.
Holdout Method: In the holdout method, the available dataset is split into two disjoint
subsets: a training set and a test set. The model is trained on the training set and then
evaluated on the test set to estimate its generalization error. The test set should be
representative of the unseen data the model will encounter in real-world scenarios. The
generalization error is typically measured using metrics such as accuracy, mean squared
error, or area under the curve.
Cross-Validation: Cross-validation is a resampling technique that helps estimate the
generalization error by iteratively splitting the dataset into training and validation
subsets. One common approach is k-fold cross-validation, where the dataset is divided
into k equally sized folds. The model is trained k times, each time using k-1 folds as the
training set and the remaining fold as the validation set. The average performance
across the k iterations provides an estimate of the generalization error. This method
helps reduce the variance in the estimated error compared to the holdout method,
especially when the dataset is limited.
K-Nearest Neighbor
o K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on Supervised
Learning technique.
o K-NN algorithm assumes the similarity between the new case/data and available cases and put
the new case into the category that is most similar to the available categories.
o K-NN algorithm stores all the available data and classifies a new data point based on the
similarity. This means when new data appears then it can be easily classified into a well suite
category by using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but mostly it is used for
the Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make any assumption on
underlying data.
o It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs an
action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it gets new data, then it
classifies that data into a category that is much similar to the new data.
o Example: Suppose, we have an image of a creature that looks similar to cat and dog, but we
want to know either it is a cat or dog. So for this identification, we can use the KNN algorithm,
as it works on a similarity measure. Our KNN model will find the similar features of the new
data set to the cats and dogs images and based on the most similar features it will put it in
either cat or dog category.
Suppose there are two categories, i.e., Category A and Category B, and we have a new data point x1,
so this data point will lie in which of these categories. To solve this type of problem, we need a K-NN
algorithm. With the help of K-NN, we can easily identify the category or class of a particular dataset.
Consider the below diagram:
The K-NN working can be explained on the basis of the below algorithm:
o Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
o Step-4: Among these k neighbors, count the number of the data points in each category.
o Step-5: Assign the new data points to that category for which the number of the neighbor is
maximum.
Predict the genre of “barbie” movie with IMDB rating 7.4 & duration 114 minutes
• Step-1:
If k=1 then lowest distance is 41. Means Barbie is also a comedy movie.
o It is simple to implement.
o Always needs to determine the value of K which may be complex some time.
o The computation cost is high because of calculating the distance between the data points for
all the training samples.
Logistic Regression
Logistic regression is a statistical method used to predict the probability of a certain event
occurring. It’s especially useful when you’re dealing with binary outcomes (i.e., yes/no, 0/1,
true/false).
Key Concepts:
• Input: You provide some input data (called features). For example, these could be things
like age, income, or other factors.
• Output: The output is a probability value between 0 and 1, representing the likelihood of
a specific outcome. In binary logistic regression, this is typically either "success" (1) or
"failure" (0).
• Logistic Function (Sigmoid Curve): Instead of making predictions directly, logistic
regression uses a special mathematical function called the logistic (or sigmoid) function
to squish the results between 0 and 1.
Problem:
You want to predict whether a student will pass or fail a test based on the number of
hours they studied.
Data:
You have the following data for several students:
Data set:
1 0
2 0
3 0
4 1
5 1
6 1
Graph:
Here is a graph showing how the probability of passing a test increases as the number of hours
studied goes up. The red dashed line represents the 0.5 threshold, which means if the probability
is greater than 0.5, the model would predict the student will pass. If it's below 0.5, the prediction
would be that the student will fail.
Support Vector Machines
The objective of the support vector machine (SVM) algorithm is to maximize the margin
which is defined as the distance between the separating hyperplane (or decision
boundary) and the training samples that are closest to this hyperplane, the so-called
support vectors. The margin is calculated as the perpendicular distance from the line to
only the closest points, as shown in Figure 4-3. Hence, SVM calculates a
maximum-margin boundary that leads to a homogeneous partition of all data points.
Unsupervised Learning
Unsupervised learning is a machine learning technique in which models are not supervised using training
dataset. Instead, models itself find the hidden patterns and insights from the given data. It can be
compared to learning which takes place in the human brain while learning new things. It can be defined
as: Unsupervised learning is a type of machine learning in which models are trained using unlabeled
dataset and are allowed to act on that data without any supervision.
Unsupervised learning cannot be directly applied to a regression or classification problem because unlike
supervised learning, we have the input data but no corresponding output data. The goal of unsupervised
learning is to find the underlying structure of dataset, group that data according to similarities, and
represent that dataset in a compressed format.
Example: Suppose the unsupervised learning algorithm is given an input dataset containing images of
different types of cats and dogs. The algorithm is never trained upon the given dataset, which means it
does not have any idea about the features of the dataset. The task of the unsupervised learning
algorithm is to identify the image features on their own. Unsupervised learning algorithm will perform
this task by clustering the image dataset into the groups according to similarities between images.
Here, we have taken an unlabeled input data, which means it is not categorized and corresponding
outputs are also not given. Now, this unlabeled input data is fed to the machine learning model in order
to train it. Firstly, it will interpret the raw data to find the hidden patterns from the data and then will
apply suitable algorithms such as k-means clustering, Decision tree, etc.
Once it applies the suitable algorithm, the algorithm divides the data objects into groups according to
the similarities and difference between the objects.
The unsupervised learning algorithm can be further categorized into two types of problems:
o Clustering: Clustering algorithms group similar data points together based on their
characteristics or proximity in the feature space. K-means clustering, hierarchical clustering, and
DBSCAN are popular clustering algorithms. Clustering can be useful for customer segmentation,
image segmentation, document clustering, and more.
o Association: An association rule is an unsupervised learning method which is used for finding
the relationships between variables in the large database. It determines the set of items that
occurs together in the dataset. Association rule makes marketing strategy more effective. Such
as people who buy X item (suppose a bread) are also tend to purchase Y (Butter/Jam) item. A
typical example of Association rule is Market Basket Analysis.
o K-means clustering
o Hierarchal clustering
o Anomaly detection
o Neural Networks
o Apriori algorithm
o Unsupervised learning is used for more complex tasks as compared to supervised learning
because, in unsupervised learning, we don't have labeled input data.
o Unsupervised learning is intrinsically more difficult than supervised learning as it does not have
corresponding output.
o The result of the unsupervised learning algorithm might be less accurate as input data is not
labeled, and algorithms do not know the exact output in advance.
Unsupervised machine learning is a branch of machine learning where the algorithm learns from
unlabeled data to discover patterns, relationships, or structures within the dataset. Unlike supervised
learning, which requires labeled examples for training, unsupervised learning operates on raw,
unclassified data.
The primary objective of unsupervised learning is to gain insights into the underlying structure of the
data and extract meaningful information without prior knowledge or guidance. It can be used for tasks
such as data exploration, clustering and dimensionality reduction.
1. Clustering: Clustering algorithms group similar data points together based on their
characteristics or proximity in the feature space. K-means clustering, hierarchical clustering, and
DBSCAN are popular clustering algorithms. Clustering can be useful for customer segmentation,
image segmentation, document clustering, and more.
Unsupervised learning techniques are valuable when dealing with large, unlabeled datasets, as they can
uncover hidden patterns, extract useful representations, and provide a foundation for subsequent
analysis or decision-making. However, evaluation and interpretation of unsupervised models can be
subjective and challenging due to the absence of explicit ground truth labels.
K-Means Clustering
K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset into
different clusters. Here K defines the number of pre-defined clusters that need to be created in the
process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and so on.
It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way that
each dataset belongs only one group that has similar properties.
It allows us to cluster the data into different groups and a convenient way to discover the categories of
groups in the unlabeled dataset on its own without the need for any training.
It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of this
algorithm is to minimize the sum of distances between the data point and their corresponding clusters.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters, and
repeats the process until it does not find the best clusters. The value of k should be predetermined in
this algorithm.
o Determines the best value for K center points or centroids by an iterative process.
o Assigns each data point to its closest k-center. Those data points which are near to the particular
k-center, create a cluster.
Hence each cluster has datapoints with some commonalities, and it is away from other clusters.
The below diagram explains the working of the K-means Clustering Algorithm:
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of each
cluster.
Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables is given
below:
o Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into different
clusters. It means here we will try to group these datasets into two different clusters.
o We need to choose some random k points or centroid to form the cluster. These points can be
either the points from the dataset or any other point. So, here we are selecting the below two
points as k points, which are not the part of our dataset. Consider the below image:
o Now we will assign each data point of the scatter plot to its closest K-point or centroid. We will
compute it by applying some mathematics that we have studied to calculate the distance
between two points. So, we will draw a median between both the centroids. Consider the below
image:
From the above image, it is clear that points left side of the line is near to the K1 or blue centroid, and
points to the right of the line are close to the yellow centroid. Let's color them as blue and yellow for
clear visualization.
o As we need to find the closest cluster, so we will repeat the process by choosing a new centroid.
To choose the new centroids, we will compute the center of gravity of these centroids, and will
find new centroids as below:
o Next, we will reassign each datapoint to the new centroid. For this, we will repeat the same
process of finding a median line. The median will be like below image:
From the above image, we can see, one yellow point is on the left side of the line, and two blue points
are right to the line. So, these three points will be assigned to new centroids.
As reassignment has taken place, so we will again go to the step-4, which is finding new centroids or
K-points.
o We will repeat the process by finding the center of gravity of centroids, so the new centroids will
be as shown in the below image:
o As we got the new centroids so again will draw the median line and reassign the data points. So,
the image will be:
o We can see in the above image; there are no dissimilar data points on either side of the line,
which means our model is formed. Consider the below image:
As our model is ready, so we can now remove the assumed centroids, and the two final clusters will be
as shown in the below image:
Numerical example of k means clustering
C1 20 500
C2 40 1000
C3 30 800
C4 18 300
C5 28 1200
C6 35 1400
C7 45 1800
The performance of the K-means clustering algorithm depends upon highly efficient clusters that it
forms. But choosing the optimal number of clusters is a big task. There are some different ways to find
the optimal number of clusters, but here we are discussing the most appropriate method to find the
number of clusters or value of K. The method is given below:
Elbow Method
The Elbow method is one of the most popular ways to find the optimal number of clusters. This method
uses the concept of WCSS value. WCSS stands for Within Cluster Sum of Squares, which defines the total
variations within a cluster. The formula to calculate the value of WCSS (for 3 clusters) is given below:
WCSS= ∑Pi in Cluster1 distance(Pi C1)2 +∑Pi in Cluster2distance(Pi C2)2+∑Pi in CLuster3 distance(Pi C3)2
∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances between each data point and its
centroid within a cluster1 and the same for the other two terms.
To measure the distance between data points and centroid, we can use any method such as Euclidean
distance or Manhattan distance.
To find the optimal value of clusters, the elbow method follows the below steps:
o It executes the K-means clustering on a given dataset for different K values (ranges from 1-10).
o Plots a curve between calculated WCSS values and the number of clusters K.
o The sharp point of bend or a point of the plot looks like an arm, then that point is considered as
the best value of K.
Hierarchical Clustering
Hierarchical clustering is a connectivity-based clustering model that groups the data points
together that are close to each other based on the measure of similarity or distance. The
assumption is that data points that are close to each other are more similar or related than data
points that are farther apart.
Types of Hierarchical Clustering
Basically, there are two types of hierarchical Clustering:
1. Agglomerative Clustering (Dendrogram Clustering)
2. Divisive clustering
Hierarchical Agglomerative Clustering
The agglomerative hierarchical clustering algorithm is a popular example of HCA(hierarchical
Clustering analysis). To group the datasets into clusters, it follows the bottom-up approach. It
means, this algorithm considers each dataset as a single cluster at the beginning, and then start
combining the closest pair of clusters together. It does this until all the clusters are merged into
a single cluster that contains all the datasets.
Divisive Clustering:
• Divisive clustering takes the opposite approach compared to agglomerative clustering.
Divisive clustering is a hierarchical clustering method that's used in machine learning to identify
large clusters. It's used in the following ways:
• Alternative to k-means: Divisive clustering can be used as an alternative to k-means
clustering.
• Faster than agglomerative clustering: Divisive clustering can be faster than agglomerative
clustering because it only takes O(N) time if the number of levels is constant.
• Splits based on all results: Divisive clustering makes splitting decisions based on all
results, while bottom-up methods make myopic merge decisions
Advantages of Hierarchical clustering
o It is simple to implement and gives the best output in some cases.
o It is easy and results in a hierarchy, a structure that contains more information.
o It does not need us to pre-specify the number of clusters.
Disadvantages of hierarchical clustering
o It breaks the large clusters.
o It is Difficult to handle different sized clusters and convex shapes.
o It is sensitive to noise and outliers.
o The algorithm can never be changed or deleted once it was done previously.
Density-Based Clustering
Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is a base algorithm for
density-based clustering. It can discover clusters of different shapes and sizes from a large
amount of data, which is containing noise and outliers.
The DBSCAN algorithm uses two parameters:
• minPts(): The minimum number of points inside the circle i.e. 3.
• eps (Epsilon): radius of circle formed with data object as centre.
There are three types of points after the DBSCAN clustering is complete:
User1 watched movies M1 and M2, giving M1 a rating of 5 and M2 a rating of 4. Both are
romantic movies, or they are by the same director. Then movie M3 was released, which user1
hasn't seen yet. Based on their interests, we will recommend this movie to user1 because it is also
a romantic movie and it is also directed by the same director.
Collaborative Filtering:
This technique is frequently used in recommender systems to identify the similarities between
user and items.
If user A and B both like product A and user B also like product B, then product B could be
recommended to user A by the systems.
Model Keeps track of what products user like and their characteristics.
Based on the common interests of user3 and user4, a product was recommended to user3.
According to the interests of user3 and user2, a product was recommended to user2.
Another Example:
User1 watched movie M1 and gave it 5 stars. User2 also watched movie M1 and gave it 5 stars.
Then, user1 watched movie M2 and gave it 4 stars, while user2 also watched movie M2 and gave
it 5 stars. User3 watched movie M3 and gave it 4 stars, but user2 hasn’t seen movie M3. Based
on their common interests up to movie M2, we will recommend movie M3 to user2 because
there is a higher possibility that user2 will like movie M3.