0% found this document useful (0 votes)
12 views

module 2 modified

Uploaded by

shitalastik
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

module 2 modified

Uploaded by

shitalastik
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 67

Presenter’s Name

Dr. Shital Bhatt


Associate Professor
School of Computational and Data Sciences

www.vidyashilpuniversity.co www.vidyashilpuniversity.co
m m
Supervised Learning Models –
Regression – Assumptions
 In supervised learning models, specifically regression models, there
are several assumptions that, if met, ensure the validity of the results.
 1. Linearity
• The relationship between the independent variables (features) and the dependent
variable (target) is linear.
• Example: If you are predicting house prices based on square footage, the
relationship between the square footage and house prices should follow a straight
line or a linear trend.
 2. Independence of Errors (No Autocorrelation)
• The residuals (errors) are independent of each other. In time-series data, this
means that there should be no correlation between the error terms at different
time points.
 In regression models, the independence of errors assumption means that the
residuals (errors) should be independent of each other. This is crucial, especially in
time-series data, where errors at one time point might be related to errors at
another time point. If the errors are correlated, it can indicate that the model is
missing some important time-dependent structure, leading to inefficient and biased
estimates.
 A common test for checking autocorrelation is the Durbin-Watson test, which
measures the correlation between residuals. A Durbin-Watson statistic near 2
indicates no autocorrelation, while values closer to 0 or 4 suggest positive or
negative autocorrelation, respectively.
• Example: Predicting stock prices based on historical data—if the errors are
correlated, future stock prices might depend on previous errors, violating this
assumption.
 3. Homoscedasticity (Constant Variance of Errors)
• The variance of the residuals (errors) should remain constant across all
levels of the independent variable(s). This means the spread of residuals
should be consistent.
• Example: Predicting employee salaries based on experience—if the
variance of the residuals increases as experience increases, the model
violates this assumption.
 4. Normality of Errors
• The residuals should be normally distributed.
• Example: In predicting the weight of an object based on its volume, the
errors should follow a normal distribution for the regression coefficients
to be valid.
 5. No Multicollinearity
• There should be little to no multicollinearity between the independent
variables. Multicollinearity occurs when independent variables are highly
correlated, making it difficult to estimate the effect of each variable.
• Example: In predicting house prices, if both square footage and number
of rooms are highly correlated, multicollinearity may exist.
 Summary of Key Assumptions in Linear Regression:
1. Linearity: The relationship between the independent and dependent
variable should be linear.
2. Independence: The errors should be independent of each other.
3. Homoscedasticity: The variance of the errors should be constant
across all levels of the independent variables.
4. Normality: The errors should be normally distributed.
5. No Multicollinearity: The independent variables should not be highly
correlated.
Supervised Learning Models – Regression –
Model Building
 In supervised learning, regression is a technique used to model the
relationship between a dependent variable (response or target) and
one or more independent variables (predictors or features). The goal
of regression is to predict the value of the target variable based on the
input features.
 Types of Regression Models
1. Linear Regression: Models a linear relationship between the
dependent and independent variables.
2. Polynomial Regression: Extends linear regression by adding
polynomial terms to capture non-linear relationships.
3. Ridge and Lasso Regression: These are regularization techniques
that prevent overfitting by adding penalties on the size of coefficients.
 Steps for Building a Regression Model
1. Import Required Libraries
2. Data Preparation: Preprocess the data to handle missing values,
outliers, and categorical variables.
3. Splitting the Data: Split the data into training and testing sets.
4. Model Building: Train the regression model using the training data.
5. Model Evaluation: Evaluate the model on test data using performance
metrics like Mean Squared Error (MSE), R-squared, etc.
6. Visualization: Visualize the results (e.g., the regression line, residuals).
Supervised Learning Models – Regression –
Inference
 Inference in regression analysis involves making predictions about the
dependent variable (outcome) based on the independent variables
(predictors) and estimating the relationship between them. This process often
includes examining the model's coefficients, confidence intervals, p-
values, and the overall goodness of fit.
 Key Concepts in Regression Inference
1. Model Coefficients: These represent the estimated change in the
response variable for a one-unit change in the predictor variable,
holding all other variables constant.
2. Hypothesis Testing: Involves testing if the coefficients are significantly
different from zero using p-values. A low p-value (< 0.05) indicates
that you can reject the null hypothesis that the coefficient is equal to
zero.
3. Confidence Intervals: These provide a range of values for the
coefficients that likely contain the true parameter values.
4. Goodness of Fit: Metrics like R-squared and Adjusted R-squared
1. Data Loading and Preparation:
1. The California housing dataset is loaded, and features and target variables are
extracted.
2. The dataset is split into training and testing sets. A constant is added to the training
set for OLS regression.
2. Linear Regression:
1. We fit a linear regression model using OLS and display the results, including
coefficients, p-values, R-squared, and adjusted R-squared values.
2. We visualize the predicted vs. actual values.
3. Ridge Regression:
1. We fit a Ridge regression model and display the coefficients. Ridge regression
applies L2 regularization, which helps to reduce overfitting by penalizing large
coefficients.
2. The predicted vs. actual values are visualized.
4. Lasso Regression:
1. A Lasso regression model is fit, and the coefficients are displayed. Lasso regression
applies L1 regularization, which can shrink some coefficients to zero, effectively
performing feature selection.
2. The predicted vs. actual values are visualized.
 Inference Results
• Coefficients: Each model will provide coefficients that represent the
estimated effect of each feature on the target variable. For Lasso, some
coefficients may be zero, indicating that those features are not
influential in predicting the target.
• P-values (in OLS): The p-values will indicate whether the coefficients
are statistically significant. Generally, a p-value < 0.05 suggests that the
predictor is significant.
• Goodness of Fit: The R-squared and adjusted R-squared values from
the OLS summary indicate how well the model explains the variability of
the target variable.
 Understanding the Output
1. Coefficients: The output will show the estimated coefficients for each
feature. For instance, if the coefficient for a feature is positive, it
suggests a positive relationship with the target variable.
2. P-values: If a p-value is less than 0.05, it suggests that the
corresponding predictor is statistically significant.
3. R-squared Value: Indicates how well the model explains the variability
of the response variable. An R-squared value close to 1 suggests a good
fit.
4. Confidence Intervals: Typically provided in the summary output, these
intervals give a range in which we expect the true coefficient values to
lie.
Supervised Learning Models –
Classification - Algorithms
 Supervised Learning models, especially classification algorithms, are
used when the target variable is discrete, meaning it falls into distinct
categories (like "spam" or "not spam"). These algorithms learn a
mapping from input features to a target label using a set of labeled
training data. The goal is to generalize this learned mapping to classify
unseen data correctly.
 Key Concepts in Classification:
1. Training: Learning a model by providing it with labeled examples.
2. Testing: Evaluating the model on unseen data to assess its
generalization ability.
3. Evaluation Metrics: Accuracy, precision, recall, F1-score, confusion
matrix, etc., are used to measure model performance.
Classification Algorithms:
1. Logistic Regression: A linear model for binary classification (can be
extended to multiclass problems).
2. K-Nearest Neighbors (KNN): A non-parametric model that classifies
based on the majority class of its nearest neighbors.
3. Support Vector Machines (SVM): A classifier that tries to find the
optimal hyperplane separating classes.
4. Decision Trees: A model that partitions data by making decisions at
each node.
5. Random Forest: An ensemble of decision trees that reduces overfitting
and improves accuracy.
6. Naive Bayes: A probabilistic model based on Bayes' Theorem,
assuming feature independence.
1. Logistic Regression
 Logistic Regression is a linear classifier used for binary or multiclass
classification. It predicts the probability that a data point belongs to a
certain class using the logistic function (sigmoid), which outputs values
between 0 and 1.
 Formula:
 Where p is the probability, w is the weight vector, and x is the input
vector.
• Useful for binary classification problems like spam detection, disease
diagnosis, and credit scoring.
2. K-Nearest Neighbors (KNN)
 KNN is a simple, non-parametric, instance-based learning algorithm. It
classifies a new data point based on the majority class of its nearest k
neighbors. It doesn't build an explicit model during training; instead, it
stores the training data and computes distances during prediction.
 Distance metric: Typically, Euclidean distance is used.
 Choosing k: The optimal number of neighbors can be found through
cross-validation.

• Suitable for small datasets and pattern recognition problems like


image classification or handwriting recognition.
3. Support Vector Machine (SVM)

 SVM finds a hyperplane that maximizes the margin between two


classes. It is a powerful classifier, particularly for high-dimensional data.
SVM can be extended to nonlinear classification using kernel tricks
(e.g., radial basis function (RBF) kernel).
• Linear SVM: For linearly separable data.
• Kernel SVM: For non-linear data, using transformations like polynomial
or RBF kernels.

• Suitable for high-dimensional datasets like text classification or


bioinformatics.
4. Decision Tree
 A decision tree is a tree-like model of decisions. It splits the data
recursively by choosing features that result in the best separation of
classes. The decision tree can be prone to overfitting but is interpretable
and easy to visualize.
• Useful for interpretable models like credit scoring, medical
diagnosis, and customer churn prediction.
5. Random Forest
 Random Forest is an ensemble method that builds multiple decision
trees on random subsets of data and features. It aggregates their
predictions to improve accuracy and reduce overfitting.
• Bootstrapping: Random sampling of the dataset to train each tree.
• Feature randomness: At each split, a random subset of features is
considered, reducing correlation between trees.
• Commonly used in finance, banking, and healthcare for tasks like
fraud detection or customer segmentation.
6. Naive Bayes

 Naive Bayes is a simple probabilistic classifier based on Bayes'


Theorem. It assumes independence between features (which is often
not true, hence "naive"). Despite this assumption, Naive Bayes often
performs surprisingly well in many domains.
• Formula:
• Common types: Gaussian Naive Bayes (for continuous data),
Multinomial Naive Bayes (for discrete counts), and Bernoulli Naive Bayes
(for binary features).

• Well-suited for text classification tasks such as spam filtering and


sentiment analysis.
 Summary:
• Logistic Regression: Simple linear classifier for binary/multiclass classification.
• KNN: Instance-based learning, effective in small datasets.
• SVM: Finds the optimal hyperplane, effective in high-dimensional data.
• Decision Trees: Intuitive, interpretable models prone to overfitting.
• Random Forest: Ensemble of decision trees, reduces overfitting and improves
accuracy.
• Naive Bayes: Probabilistic model assuming feature independence, effective for
text data.
 Each classification algorithm has its strengths and weaknesses, and the choice of
algorithm depends on the problem's nature, the data size, and the need for
interpretability vs. accuracy.
Machine Learning Evaluation Metrics
ML Evaluation Metrics Are…..
 tied to Machine Learning Tasks
 methods which determine an algorithm’s performance and behavior
 helpful to decide the best model to meet the target performance
 helpful to parameterize the model in such a way that can offer best performing algorithm
Evaluation Metrics Types...
 Various types of ML Algorithms (classification, regression, ranking, clustering)
 Different types of evaluation metrics for different types of algorithm
 Some metrics can be useful for more than one type of algorithm (Precision - Recall)
 Will cover Evaluation Metrics for Supervised learning models only ( Classification,
Regression, Ranking)
Classification Metrics
Classification Model Does...
 Predict class labels given input data
 In Binary classification, there are two possible output classes ( 0 or 1, True or False,
Positive or Negative, Yes or No etc.)
 Spam detection of email is a good example of Binary classification.
Some Popular Classification Metrics...
 Accuracy
 Confusion Matrix
 Log-Loss
 AUC
Accuracy
 Ratio between the number of correct predictions and total number of predictions

 Example: Suppose we have 100 examples in the positive class and 200 examples in the
negative class. Our model declares 80 out of 100 positives as positive correctly and 195
out of 200 negatives as negative correctly.
 So, accuracy is = (80 + 195)/(100 + 200) = 91.7%
Confusion Matrix
 Shows a more detailed breakdown of correct and incorrect classifications for
each class.
 Think about our previous example and then the confusion matrix looks like:

 What is the accuracy that positive class has ? And Negative class?
 Clearly, positive class has lower accuracy than the negative class
 And that information is lost if we calculate overall accuracy only.

Predicted as positive Predicted as negative

Labeled as positive 80 20

Labeled as negative 5 195


Per-Class Accuracy
 Average per class accuracy of previous example:
(80% + 97.5%)/2 = 88.75 %, different from accuracy
Log-Loss
 Very much useful when the raw output of classifier is a numeric probability instead of a
class label 0 or 1
 Mathematically , log-loss for a binary classifier:

 Minimum is 0 when prediction and true label match up


 Calculate for a data point predicted by classifier to belong to class 1 with probability .51
and with probability 1
 Minimizing this value, maximizing the accuracy of the classifier
AUC (Area Under Curve)
 The curve is receiver operating
characteristic curve or in short ROC
curve
 Provides nuanced details about the
behavior of the classifier
 Bad ROC curve covers very little area
 Good ROC curve has a lot of space
under it
Ranking Metrics
Ranking ...
 Is related to binary classification
 Internet Search can be a good example which acts as a ranker.
 During a query, it returns ranked list of web pages relevant to that query
 So, here ranking can be a binary classification of “relevant query” or “irrelevant query”
 It also ordering the results so that the most relevant result should be on top
 So, what can be done in underlying implementation considering both??
 Can we predict what will ranking metrics evaluate and how?
Some Ranking Metrics..
 Precision - Recall
 Precision - Recall Curve and F1 Score
Precision - Recall
Considering the scenario of web search result, Precision answers this question:
“Out of the items that the ranker/classifier predicted to be relevant, how many are truly
relevant?”
Whereas, Recall answers this:
“Out of all the items that are truly relevant, how many are found by the ranker/classifier?”
Calculation Example Of Precision- Recall
Precision = TP / (TP+FP)
= 60 / (60 + 140) = 30%
Recall = TP / (TP+FN)
= 60 / (60+40) = 60%
Total Negative = 9760 + 140 = 9900
Total Positive = 40 + 60 = 100
Total Negative prediction = 9760 Predicted Predicted
+ 40 = 9800 Total Positive prediction = as as
140 + 60 = 200 Negative Positive

Actual 9760 (TN) 140 (FP)


Negative

Actual 40 (FN) 60 (TP)


Positive
Precision - Recall Curve
Trade-off between Recall and Precision
F-Measure
 One measure of performance that takes into account both recall and precision
 Harmonic mean of recall and precision:

 Compared to arithmetic mean, both need to be high for harmonic mean to be high
Regression Metrics
What Regression Tasks do?
 Model learns to predict numeric scores.
 For example, we try to predict the price of a stock on future days given past price history
and other useful information
Some Regression Metrics..
 RMSE (Root Mean Square Error)
 Quantiles of Errors
RMSE
 The most commonly used metric for regression tasks
 Also known as RMSD ( root-mean-square deviation)
 This is defined as the square root of the average squared distance between the actual
score and the predicted score:
Quantiles of Errors
 RMSE is an average, so it is sensitive to large outliers.
 If the regressor performs really badly on a single data point, the average error could be
big, not robust
Clustering in Machine Learning

 Clustering or cluster analysis is a machine learning technique, which groups the unlabelled
dataset. It can be defined as "A way of grouping the data points into different
clusters, consisting of similar data points. The objects with the possible similarities
remain in a group that has less or no similarities with another group.“

 It does it by finding some similar patterns in the unlabelled dataset such as shape, size, color,
behavior, etc., and divides them as per the presence and absence of those similar patterns.

 It is an unsupervised learning method, hence no supervision is provided to the algorithm, and


it deals with the unlabeled dataset.

 After applying this clustering technique, each cluster or group is provided with a cluster-ID.
ML system can use this id to simplify the processing of large and complex datasets.
 The clustering technique is commonly used for statistical data analysis.

 Example: Let's understand the clustering technique with the real-world example of Mall:
 The clustering technique can be widely used in various tasks. Some most common uses of
this technique are:
• Market Segmentation
• Statistical data analysis
• Social network analysis
• Image segmentation
• Anomaly detection, etc.

 Apart from these general usages, it is used by the Amazon in its recommendation system to
provide the recommendations as per the past search of products. Netflix also uses this
technique to recommend the movies and web-series to its users as per the watch history.
Types of Clustering Methods

 The clustering methods are broadly divided into Hard clustering (datapoint belongs to only
one group) and Soft Clustering (data points can belong to another group also). But there
are also other various approaches of Clustering exist. Below are the main clustering methods
used in Machine learning:
1. Partitioning Clustering
2. Density-Based Clustering
3. Distribution Model-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering
Partitioning Clustering

 It is a type of clustering that divides the data into non-hierarchical groups. It is also known as
the centroid-based method. The most common example of partitioning clustering is the
K-Means Clustering algorithm.
 In this type, the dataset is divided into a set of k groups, where K is used to define the
number of pre-defined groups. The cluster center is created in such a way that the distance
between the data points of one cluster is minimum as compared to another cluster centroid.
Density-Based Clustering

 The density-based clustering method connects the highly-dense areas into clusters, and the
arbitrarily shaped distributions are formed as long as the dense region can be connected. This
algorithm does it by identifying different clusters in the dataset and connects the areas of
high densities into clusters. The dense areas in data space are divided from each other by
sparser areas.
 These algorithms can face difficulty in clustering the data points if the dataset has varying
densities and high dimensions.
Distribution Model-Based Clustering
 In the distribution model-based clustering method, the data is divided based on the
probability of how a dataset belongs to a particular distribution. The grouping is done by
assuming some distributions commonly Gaussian Distribution.
 The example of this type is the Expectation-Maximization Clustering algorithm that
uses Gaussian Mixture Models (GMM)
Hierarchical Clustering

 Hierarchical clustering can be used as an alternative for the partitioned clustering as there is
no requirement of pre-specifying the number of clusters to be created. In this technique, the
dataset is divided into clusters to create a tree-like structure, which is also called
a dendrogram. The observations or any number of clusters can be selected by cutting the
tree at the correct level. The most common example of this method is the Agglomerative
Hierarchical algorithm.
Fuzzy Clustering

 Fuzzy clustering is a type of soft method in which a data object may belong to more than one
group or cluster. Each dataset has a set of membership coefficients, which depend on the
degree of membership to be in a cluster. Fuzzy C-means algorithm is the example of this
type of clustering; it is sometimes also known as the Fuzzy k-means algorithm.
Clustering Algorithms

 The Clustering algorithms can be divided based on their models that are explained above.
The clustering algorithm is based on the kind of data that we are using. Such as, some
algorithms need to guess the number of clusters in the given dataset, whereas some are
required to find the minimum distance between the observation of the dataset.
1. K-Means algorithm: The k-means algorithm is one of the most popular clustering
algorithms. It classifies the dataset by dividing the samples into different clusters of equal
variances. The number of clusters must be specified in this algorithm. It is fast with fewer
computations required, with the linear complexity of O(n).
2. DBSCAN Algorithm: It stands for Density-Based Spatial Clustering of Applications
with Noise. It is an example of a density-based model similar to the mean-shift, but with
some remarkable advantages. In this algorithm, the areas of high density are separated by
the areas of low density. Because of this, the clusters can be found in any arbitrary shape.
3. Expectation-Maximization Clustering using GMM: This algorithm can be used as an
alternative for the k-means algorithm or for those cases where K-means can be failed. In
GMM, it is assumed that the data points are Gaussian distributed.
4. Agglomerative Hierarchical algorithm: The Agglomerative hierarchical algorithm
performs the bottom-up hierarchical clustering. In this, each data point is treated as a single
cluster at the outset and then successively merged. The cluster hierarchy can be represented
as a tree-structure.
Applications of Clustering

 Below are some commonly known applications of clustering technique in Machine Learning:
• In Identification of Cancer Cells: The clustering algorithms are widely used for the
identification of cancerous cells. It divides the cancerous and non-cancerous data sets into
different groups.
• In Search Engines: Search engines also work on the clustering technique. The search result
appears based on the closest object to the search query. It does it by grouping similar data
objects in one group that is far from the other dissimilar objects. The accurate result of a
query depends on the quality of the clustering algorithm used.
• Customer Segmentation: It is used in market research to segment the customers based on
their choice and preferences.
• In Biology: It is used in the biology stream to classify different species of plants and animals
using the image recognition technique.
• In Land Use: The clustering technique is used in identifying the area of similar lands use in
the GIS database. This can be very useful to find that for what purpose the particular land
should be used, that means for which purpose it is more suitable.
Un-Supervised Learning Models – Clustering – Algorithms

 Clustering: Grouping similar data points together based on their features.


 Similarity/Dissimilarity: Clustering algorithms are based on a similarity or dissimilarity
measure, often the distance between data points (e.g., Euclidean distance).
Types of Unsupervised Clustering Algorithms:

1. K-Means Clustering
2. Hierarchical Clustering
3. DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
4. Gaussian Mixture Models (GMM)
5. Agglomerative Clustering
1. K-Means Clustering

 K-Means is one of the most commonly used clustering algorithms. It partitions data into k
clusters, where each data point belongs to the cluster with the nearest centroid.
 Steps:
1. Choose k centroids randomly.
2. Assign each data point to the nearest centroid.
3. Update the centroids by calculating the mean of all data points in each cluster.
4. Repeat until centroids no longer move or a set number of iterations is reached.
2. Hierarchical Clustering

 Hierarchical clustering builds a tree of clusters. There are two types:


• Agglomerative (Bottom-Up): Starts with each point as its own cluster and merges clusters.
• Divisive (Top-Down): Starts with one cluster and recursively splits it.
 A dendrogram is a common visualization used for hierarchical clustering.
3. DBSCAN (Density-Based Spatial Clustering)

 DBSCAN is a density-based clustering algorithm that forms clusters based on regions of high
density. It can find clusters of arbitrary shapes and handle noise (outliers).
 Parameters:
• eps: Maximum distance between two points to be considered neighbors.
• min_samples: Minimum number of points required to form a dense region.
4. Gaussian Mixture Models (GMM)

 GMM assumes that the data is generated from a mixture of several Gaussian distributions.
Unlike K-Means, GMM can model clusters of different shapes and sizes because it considers
both the mean and the covariance of the data.
Summary of Clustering Algorithms:

Algorithm Strengths Weaknesses

Sensitive to initial
Simple, fast, efficient for
K-Means centroids, cannot handle
spherical clusters
non-spherical clusters

Creates a hierarchy of Difficult for large


Hierarchical clusters, visualizable datasets, does not scale
(dendrogram) well
Handles noise, finds Requires parameter
DBSCAN arbitrarily shaped tuning (eps,
clusters min_samples)
Computationally
Models clusters with expensive, needs prior
GMM
different shapes/sizes knowledge of number of
clusters
Time-Series Forecasting
 Time-series forecasting is the process of predicting future values based on previously
observed values. Time-series data consists of observations collected at regular intervals
(daily, monthly, yearly, etc.), and the goal is to analyze the data trends, seasonality, and
other components to forecast future values.
 Key Characteristics of Time-Series Data:
1. Trend: The general direction in which the data is moving (upward, downward, or flat).
2. Seasonality: Recurring patterns or cycles in data at regular intervals (e.g., quarterly or
yearly).
3. Noise: Random fluctuations in data that do not follow any pattern.
4. Autocorrelation: When current values are influenced by past values.
Steps in Time-Series Forecasting:

1. Data Collection: Obtain a time-series dataset.


2. Data Preprocessing: Handle missing data, detect and remove outliers, scale the data, etc.
3. Exploratory Data Analysis (EDA): Visualize trends, seasonality, and other patterns.
4. Train/Test Split: Divide the data into training and testing sets.
5. Modeling: Build models to learn from the training data and make predictions.
6. Evaluation: Measure model performance on the test set using metrics like Mean Squared
Error (MSE), Mean Absolute Error (MAE), etc.
7. Prediction: Forecast future values using the trained model.

You might also like