module 2 modified
module 2 modified
www.vidyashilpuniversity.co www.vidyashilpuniversity.co
m m
Supervised Learning Models –
Regression – Assumptions
In supervised learning models, specifically regression models, there
are several assumptions that, if met, ensure the validity of the results.
1. Linearity
• The relationship between the independent variables (features) and the dependent
variable (target) is linear.
• Example: If you are predicting house prices based on square footage, the
relationship between the square footage and house prices should follow a straight
line or a linear trend.
2. Independence of Errors (No Autocorrelation)
• The residuals (errors) are independent of each other. In time-series data, this
means that there should be no correlation between the error terms at different
time points.
In regression models, the independence of errors assumption means that the
residuals (errors) should be independent of each other. This is crucial, especially in
time-series data, where errors at one time point might be related to errors at
another time point. If the errors are correlated, it can indicate that the model is
missing some important time-dependent structure, leading to inefficient and biased
estimates.
A common test for checking autocorrelation is the Durbin-Watson test, which
measures the correlation between residuals. A Durbin-Watson statistic near 2
indicates no autocorrelation, while values closer to 0 or 4 suggest positive or
negative autocorrelation, respectively.
• Example: Predicting stock prices based on historical data—if the errors are
correlated, future stock prices might depend on previous errors, violating this
assumption.
3. Homoscedasticity (Constant Variance of Errors)
• The variance of the residuals (errors) should remain constant across all
levels of the independent variable(s). This means the spread of residuals
should be consistent.
• Example: Predicting employee salaries based on experience—if the
variance of the residuals increases as experience increases, the model
violates this assumption.
4. Normality of Errors
• The residuals should be normally distributed.
• Example: In predicting the weight of an object based on its volume, the
errors should follow a normal distribution for the regression coefficients
to be valid.
5. No Multicollinearity
• There should be little to no multicollinearity between the independent
variables. Multicollinearity occurs when independent variables are highly
correlated, making it difficult to estimate the effect of each variable.
• Example: In predicting house prices, if both square footage and number
of rooms are highly correlated, multicollinearity may exist.
Summary of Key Assumptions in Linear Regression:
1. Linearity: The relationship between the independent and dependent
variable should be linear.
2. Independence: The errors should be independent of each other.
3. Homoscedasticity: The variance of the errors should be constant
across all levels of the independent variables.
4. Normality: The errors should be normally distributed.
5. No Multicollinearity: The independent variables should not be highly
correlated.
Supervised Learning Models – Regression –
Model Building
In supervised learning, regression is a technique used to model the
relationship between a dependent variable (response or target) and
one or more independent variables (predictors or features). The goal
of regression is to predict the value of the target variable based on the
input features.
Types of Regression Models
1. Linear Regression: Models a linear relationship between the
dependent and independent variables.
2. Polynomial Regression: Extends linear regression by adding
polynomial terms to capture non-linear relationships.
3. Ridge and Lasso Regression: These are regularization techniques
that prevent overfitting by adding penalties on the size of coefficients.
Steps for Building a Regression Model
1. Import Required Libraries
2. Data Preparation: Preprocess the data to handle missing values,
outliers, and categorical variables.
3. Splitting the Data: Split the data into training and testing sets.
4. Model Building: Train the regression model using the training data.
5. Model Evaluation: Evaluate the model on test data using performance
metrics like Mean Squared Error (MSE), R-squared, etc.
6. Visualization: Visualize the results (e.g., the regression line, residuals).
Supervised Learning Models – Regression –
Inference
Inference in regression analysis involves making predictions about the
dependent variable (outcome) based on the independent variables
(predictors) and estimating the relationship between them. This process often
includes examining the model's coefficients, confidence intervals, p-
values, and the overall goodness of fit.
Key Concepts in Regression Inference
1. Model Coefficients: These represent the estimated change in the
response variable for a one-unit change in the predictor variable,
holding all other variables constant.
2. Hypothesis Testing: Involves testing if the coefficients are significantly
different from zero using p-values. A low p-value (< 0.05) indicates
that you can reject the null hypothesis that the coefficient is equal to
zero.
3. Confidence Intervals: These provide a range of values for the
coefficients that likely contain the true parameter values.
4. Goodness of Fit: Metrics like R-squared and Adjusted R-squared
1. Data Loading and Preparation:
1. The California housing dataset is loaded, and features and target variables are
extracted.
2. The dataset is split into training and testing sets. A constant is added to the training
set for OLS regression.
2. Linear Regression:
1. We fit a linear regression model using OLS and display the results, including
coefficients, p-values, R-squared, and adjusted R-squared values.
2. We visualize the predicted vs. actual values.
3. Ridge Regression:
1. We fit a Ridge regression model and display the coefficients. Ridge regression
applies L2 regularization, which helps to reduce overfitting by penalizing large
coefficients.
2. The predicted vs. actual values are visualized.
4. Lasso Regression:
1. A Lasso regression model is fit, and the coefficients are displayed. Lasso regression
applies L1 regularization, which can shrink some coefficients to zero, effectively
performing feature selection.
2. The predicted vs. actual values are visualized.
Inference Results
• Coefficients: Each model will provide coefficients that represent the
estimated effect of each feature on the target variable. For Lasso, some
coefficients may be zero, indicating that those features are not
influential in predicting the target.
• P-values (in OLS): The p-values will indicate whether the coefficients
are statistically significant. Generally, a p-value < 0.05 suggests that the
predictor is significant.
• Goodness of Fit: The R-squared and adjusted R-squared values from
the OLS summary indicate how well the model explains the variability of
the target variable.
Understanding the Output
1. Coefficients: The output will show the estimated coefficients for each
feature. For instance, if the coefficient for a feature is positive, it
suggests a positive relationship with the target variable.
2. P-values: If a p-value is less than 0.05, it suggests that the
corresponding predictor is statistically significant.
3. R-squared Value: Indicates how well the model explains the variability
of the response variable. An R-squared value close to 1 suggests a good
fit.
4. Confidence Intervals: Typically provided in the summary output, these
intervals give a range in which we expect the true coefficient values to
lie.
Supervised Learning Models –
Classification - Algorithms
Supervised Learning models, especially classification algorithms, are
used when the target variable is discrete, meaning it falls into distinct
categories (like "spam" or "not spam"). These algorithms learn a
mapping from input features to a target label using a set of labeled
training data. The goal is to generalize this learned mapping to classify
unseen data correctly.
Key Concepts in Classification:
1. Training: Learning a model by providing it with labeled examples.
2. Testing: Evaluating the model on unseen data to assess its
generalization ability.
3. Evaluation Metrics: Accuracy, precision, recall, F1-score, confusion
matrix, etc., are used to measure model performance.
Classification Algorithms:
1. Logistic Regression: A linear model for binary classification (can be
extended to multiclass problems).
2. K-Nearest Neighbors (KNN): A non-parametric model that classifies
based on the majority class of its nearest neighbors.
3. Support Vector Machines (SVM): A classifier that tries to find the
optimal hyperplane separating classes.
4. Decision Trees: A model that partitions data by making decisions at
each node.
5. Random Forest: An ensemble of decision trees that reduces overfitting
and improves accuracy.
6. Naive Bayes: A probabilistic model based on Bayes' Theorem,
assuming feature independence.
1. Logistic Regression
Logistic Regression is a linear classifier used for binary or multiclass
classification. It predicts the probability that a data point belongs to a
certain class using the logistic function (sigmoid), which outputs values
between 0 and 1.
Formula:
Where p is the probability, w is the weight vector, and x is the input
vector.
• Useful for binary classification problems like spam detection, disease
diagnosis, and credit scoring.
2. K-Nearest Neighbors (KNN)
KNN is a simple, non-parametric, instance-based learning algorithm. It
classifies a new data point based on the majority class of its nearest k
neighbors. It doesn't build an explicit model during training; instead, it
stores the training data and computes distances during prediction.
Distance metric: Typically, Euclidean distance is used.
Choosing k: The optimal number of neighbors can be found through
cross-validation.
Example: Suppose we have 100 examples in the positive class and 200 examples in the
negative class. Our model declares 80 out of 100 positives as positive correctly and 195
out of 200 negatives as negative correctly.
So, accuracy is = (80 + 195)/(100 + 200) = 91.7%
Confusion Matrix
Shows a more detailed breakdown of correct and incorrect classifications for
each class.
Think about our previous example and then the confusion matrix looks like:
What is the accuracy that positive class has ? And Negative class?
Clearly, positive class has lower accuracy than the negative class
And that information is lost if we calculate overall accuracy only.
Labeled as positive 80 20
Compared to arithmetic mean, both need to be high for harmonic mean to be high
Regression Metrics
What Regression Tasks do?
Model learns to predict numeric scores.
For example, we try to predict the price of a stock on future days given past price history
and other useful information
Some Regression Metrics..
RMSE (Root Mean Square Error)
Quantiles of Errors
RMSE
The most commonly used metric for regression tasks
Also known as RMSD ( root-mean-square deviation)
This is defined as the square root of the average squared distance between the actual
score and the predicted score:
Quantiles of Errors
RMSE is an average, so it is sensitive to large outliers.
If the regressor performs really badly on a single data point, the average error could be
big, not robust
Clustering in Machine Learning
Clustering or cluster analysis is a machine learning technique, which groups the unlabelled
dataset. It can be defined as "A way of grouping the data points into different
clusters, consisting of similar data points. The objects with the possible similarities
remain in a group that has less or no similarities with another group.“
It does it by finding some similar patterns in the unlabelled dataset such as shape, size, color,
behavior, etc., and divides them as per the presence and absence of those similar patterns.
After applying this clustering technique, each cluster or group is provided with a cluster-ID.
ML system can use this id to simplify the processing of large and complex datasets.
The clustering technique is commonly used for statistical data analysis.
Example: Let's understand the clustering technique with the real-world example of Mall:
The clustering technique can be widely used in various tasks. Some most common uses of
this technique are:
• Market Segmentation
• Statistical data analysis
• Social network analysis
• Image segmentation
• Anomaly detection, etc.
Apart from these general usages, it is used by the Amazon in its recommendation system to
provide the recommendations as per the past search of products. Netflix also uses this
technique to recommend the movies and web-series to its users as per the watch history.
Types of Clustering Methods
The clustering methods are broadly divided into Hard clustering (datapoint belongs to only
one group) and Soft Clustering (data points can belong to another group also). But there
are also other various approaches of Clustering exist. Below are the main clustering methods
used in Machine learning:
1. Partitioning Clustering
2. Density-Based Clustering
3. Distribution Model-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering
Partitioning Clustering
It is a type of clustering that divides the data into non-hierarchical groups. It is also known as
the centroid-based method. The most common example of partitioning clustering is the
K-Means Clustering algorithm.
In this type, the dataset is divided into a set of k groups, where K is used to define the
number of pre-defined groups. The cluster center is created in such a way that the distance
between the data points of one cluster is minimum as compared to another cluster centroid.
Density-Based Clustering
The density-based clustering method connects the highly-dense areas into clusters, and the
arbitrarily shaped distributions are formed as long as the dense region can be connected. This
algorithm does it by identifying different clusters in the dataset and connects the areas of
high densities into clusters. The dense areas in data space are divided from each other by
sparser areas.
These algorithms can face difficulty in clustering the data points if the dataset has varying
densities and high dimensions.
Distribution Model-Based Clustering
In the distribution model-based clustering method, the data is divided based on the
probability of how a dataset belongs to a particular distribution. The grouping is done by
assuming some distributions commonly Gaussian Distribution.
The example of this type is the Expectation-Maximization Clustering algorithm that
uses Gaussian Mixture Models (GMM)
Hierarchical Clustering
Hierarchical clustering can be used as an alternative for the partitioned clustering as there is
no requirement of pre-specifying the number of clusters to be created. In this technique, the
dataset is divided into clusters to create a tree-like structure, which is also called
a dendrogram. The observations or any number of clusters can be selected by cutting the
tree at the correct level. The most common example of this method is the Agglomerative
Hierarchical algorithm.
Fuzzy Clustering
Fuzzy clustering is a type of soft method in which a data object may belong to more than one
group or cluster. Each dataset has a set of membership coefficients, which depend on the
degree of membership to be in a cluster. Fuzzy C-means algorithm is the example of this
type of clustering; it is sometimes also known as the Fuzzy k-means algorithm.
Clustering Algorithms
The Clustering algorithms can be divided based on their models that are explained above.
The clustering algorithm is based on the kind of data that we are using. Such as, some
algorithms need to guess the number of clusters in the given dataset, whereas some are
required to find the minimum distance between the observation of the dataset.
1. K-Means algorithm: The k-means algorithm is one of the most popular clustering
algorithms. It classifies the dataset by dividing the samples into different clusters of equal
variances. The number of clusters must be specified in this algorithm. It is fast with fewer
computations required, with the linear complexity of O(n).
2. DBSCAN Algorithm: It stands for Density-Based Spatial Clustering of Applications
with Noise. It is an example of a density-based model similar to the mean-shift, but with
some remarkable advantages. In this algorithm, the areas of high density are separated by
the areas of low density. Because of this, the clusters can be found in any arbitrary shape.
3. Expectation-Maximization Clustering using GMM: This algorithm can be used as an
alternative for the k-means algorithm or for those cases where K-means can be failed. In
GMM, it is assumed that the data points are Gaussian distributed.
4. Agglomerative Hierarchical algorithm: The Agglomerative hierarchical algorithm
performs the bottom-up hierarchical clustering. In this, each data point is treated as a single
cluster at the outset and then successively merged. The cluster hierarchy can be represented
as a tree-structure.
Applications of Clustering
Below are some commonly known applications of clustering technique in Machine Learning:
• In Identification of Cancer Cells: The clustering algorithms are widely used for the
identification of cancerous cells. It divides the cancerous and non-cancerous data sets into
different groups.
• In Search Engines: Search engines also work on the clustering technique. The search result
appears based on the closest object to the search query. It does it by grouping similar data
objects in one group that is far from the other dissimilar objects. The accurate result of a
query depends on the quality of the clustering algorithm used.
• Customer Segmentation: It is used in market research to segment the customers based on
their choice and preferences.
• In Biology: It is used in the biology stream to classify different species of plants and animals
using the image recognition technique.
• In Land Use: The clustering technique is used in identifying the area of similar lands use in
the GIS database. This can be very useful to find that for what purpose the particular land
should be used, that means for which purpose it is more suitable.
Un-Supervised Learning Models – Clustering – Algorithms
1. K-Means Clustering
2. Hierarchical Clustering
3. DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
4. Gaussian Mixture Models (GMM)
5. Agglomerative Clustering
1. K-Means Clustering
K-Means is one of the most commonly used clustering algorithms. It partitions data into k
clusters, where each data point belongs to the cluster with the nearest centroid.
Steps:
1. Choose k centroids randomly.
2. Assign each data point to the nearest centroid.
3. Update the centroids by calculating the mean of all data points in each cluster.
4. Repeat until centroids no longer move or a set number of iterations is reached.
2. Hierarchical Clustering
DBSCAN is a density-based clustering algorithm that forms clusters based on regions of high
density. It can find clusters of arbitrary shapes and handle noise (outliers).
Parameters:
• eps: Maximum distance between two points to be considered neighbors.
• min_samples: Minimum number of points required to form a dense region.
4. Gaussian Mixture Models (GMM)
GMM assumes that the data is generated from a mixture of several Gaussian distributions.
Unlike K-Means, GMM can model clusters of different shapes and sizes because it considers
both the mean and the covariance of the data.
Summary of Clustering Algorithms:
Sensitive to initial
Simple, fast, efficient for
K-Means centroids, cannot handle
spherical clusters
non-spherical clusters