SemVII_MachineLearning
SemVII_MachineLearning
Mod. 1
Q1. What are the Steps to Design a Machine Learning Problem ?
OR Explain the Procedure to Designing a Machine Learning System.
OR Steps in developing a ML application.
Q2. What are the issues and challenges in Machine Learning ? [V IMP]
Q.3 Explain the steps required for selecting the right Machine Learning algorithm [V IMP]
Q.8 Explain any five performance measures along with examples or Performance metrics
for Classification [V IMP]
OR Explain performance evaluation metrics for binary classification with a suitable
example.
- Accuracy: Accuracy measures the proportion of correctly predicted instances (both true
positives and true negatives) out of the total number of instances.
Accuracy = (TP+TN) / Total
Example: Suppose a model predicts whether an email is spam or not, and out of 100
emails, it correctly classifies 85 (both spam and not spam). The accuracy would be
85/100 = 0.85 or 85%
- Precision: Precision indicates the amount of positive predictions that are actually
correct. It answers the question, “Of all items predicted as positive, how many are truly
positive?”
Precision = TP/(TP+FP)
Example: In a medical diagnosis, if a model predicts cancer, precision tells us how many
of those predicted cases are actually cancerous. If the model predicted cancer 20 times,
but only 15 were correct, precision would be 15/20 = 0.75 or 75%.
- Recall: Recall measures the proportion of actual positives that were correctly identified
by the model. It answers, “Of all items that are truly positive, how many did we correctly
identify?”
Recall: TP / (TP+FN)
Example: If there are 25 fraudulent transactions, and the model identifies 20, recall
would be 20/25 = 0.8 or 80%.
- F1 Score: The F1 score is the mean of precision and recall, offering a balance between
them. It’s useful when you need to consider both false positives and false negatives and
is particularly helpful in cases of imbalanced datasets
- AUC: AUC represents the area under the ROC curve, which plots the True Positive Rate
(TPR) against the False Positive Rate (FPR) across various threshold levels. AUC
values range from 0 to 1, with 1 indicating perfect classification.
Example: In binary classification, if a model has an AUC of 0.95, it’s generally
considered very good, meaning the model has a high true positive rate and low false
positive rate across different thresholds.
Q.9 Explain Regression line, Scatter plot, Error in prediction and Best fitting line.
Regression Line: The regression line is a straight line that represents the relationship between
two variables in a simple linear regression. This line is drawn through a set of data points in
such a way that it minimizes the total distance between itself and the actual data points. The
regression line is used to predict the dependent variable (Y) based on the value of the
independent variable (X).
Equation for a simple linear regression line is Y = mX + c, where:
Y is the Predicted value
m is the Slope of the line
c is the Y-intercept
Scatter Plot: A scatter plot is a graphical representation of individual data points, where each
point represents a pair of values for the variables X(independent) and Y(dependent). This allows
you to visually assess the relationship between the two variables.
Scatter plots help to visualize whether there is a trend or correlation between X and Y. For
instance, in a dataset of house prices (Y) and house size (X), a scatter plot could show a
positive correlation if larger houses tend to be more expensive.
Error in Prediction: The error in prediction, also known as residual in regression, is the
difference between the actual observed value and the predicted value on the regression line.
This error shows how far off a model's prediction is from the actual observed value. The aim in
regression is to minimize these errors across all data points to make the predictions more
accurate.
Error (Residual) = Y observed - Y predicted, If the actual sales of a product (Y) are 100 units
and the model predicts 90 units, then the error is 100−90 = 10.
Best Fitting Line: The best fitting line, also known as the line of best fit, is the regression line
that minimizes the sum of squared errors (the differences between actual and predicted values)
across all data points, ensuring that the overall error is as low as possible. The best fitting line
provides the most accurate representation of the linear relationship between X and Y.
Q.10 Explain the terms overfitting, underfitting, bias & variance tradeoff with respect to
Machine Learning.
Decision Boundary: By setting a threshold, such as 0.5, we can classify instances into classes.
If the probability is greater than 0.5, the model assigns the instance to class 1; otherwise, it
assigns it to class 0.
Applications: Logistic regression is widely used in various domains for classification tasks,
including
Medical Diagnosis: Predicting the presence or absence of a disease
Marketing: Predicting whether a customer will purchase a product
Finance: Assessing the likelihood of loan default
Logistic Regression Support Vector Machine
Preferred for simple and linearly separable Preferred when there is a need for complex
problems decision boundaries
Q.14 Explain the different ways to combine the classifier [VV IMP]
Ensemble Learning leverages multiple individual models, or "weak learners," to create a
stronger combined model that performs better than any single classifier alone. The idea is that
by combining different models, we can reduce errors and variance, leading to more accurate
and stable predictions.
Here are the main ways to combine classifiers under ensemble learning:
Bagging: Bagging reduces variance by training multiple models on different random subsets of
the training data (with replacement) and then averaging their predictions (or taking a majority
vote). Each model, often a decision tree, is trained on a random sample of the data. Once all
models are trained, predictions from each model are aggregated.
Bagging is useful in reducing overfitting and variance, especially in high-variance models like
decision trees.
Example: Random Forest
Boosting: Boosting aims to convert weak learners into strong learners by sequentially training
models, with each model attempting to correct the errors of the previous one. Each new model
focuses on the samples that were previously misclassified. The final prediction is a weighted
average (or weighted vote) of the predictions.
Boosting is effective for reducing both bias and variance, creating a more accurate model
Example: AdaBoost and Gradient Boosting
Stacking: Stacking uses multiple models of different types and combines their predictions using
a "meta-learner" or "meta-model," which learns how to best combine these predictions.
Stacking can often yield better results by leveraging the strengths of diverse models, though it
can be more complex and computationally intensive.
Example: A stacking ensemble could combine a decision tree, a logistic regression model, and
a support vector machine, with a meta-model learning the optimal combination of their outputs.
Q.16 Random forest algorithm or Explain Ensemble learning algorithm Random Forest
and its use cases in real-world applications. [VV IMP]
Real-world applications of Random Forest:
Healthcare Diagnostics Random Forest is frequently used for predicting the likelihood of
diseases like diabetes, heart disease, and cancer by analyzing patient data, symptoms, and
diagnostic indicators.
Weather Forecasting Random Forest can be used in weather prediction models to forecast
temperatures, rainfall, and other weather conditions. It helps in analyzing meteorological data to
predict local and regional climate trends, aiding agriculture, disaster management, and public
safety.
Finance Financial institutions use Random Forest for credit scoring, fraud detection, and risk
assessment. The model can help classify loan applicants as high or low risk by analyzing
factors such as credit history, income, and transaction patterns.
Stock Market Analysis Traders and analysts use Random Forest for stock price prediction by
analyzing historical data, market trends, and external economic indicators.
Mod. 4 are diddy bkl tmkc 😭 bhsda band kr(mi pn tech tr mhntoy)
Q.17 Explain Multiclass Classification [VV IMP]
Multi-class classification is a type of classification task where a model is designed to classify
input data into one of three or more classes (categories). Unlike binary classification, where
there are only two possible classes, multi-class classification involves multiple distinct
categories, requiring the model to make more complex decisions.
Training Data: The model is trained on labeled data where each instance belongs to one of
several classes.
Prediction: For each new instance, the model predicts the most likely class among the multiple
options.
Q.18 Define Support Vector Machine. Explain how margin is computed and optimal
hyper-plane is decided + Concept of Margin and SVM. [IMP]
Kernel: A kernel is a mathematical function in SVM that helps us handle data that isn’t linearly
separable in its original form. It works by transforming the data into a higher-dimensional space
where it becomes easier to separate using a straight line (or hyperplane).
If we have two groups of points that can’t be separated by a line in a 2D plane, we can lift those
points into 3D space, in this new space it might be possible to draw a plane that separates
them.
Hard Margin: A hard margin refers to the maximum-margin hyperplane that perfectly separates
the data points of different classes without any misclassifications.
Soft Margin: When data contains outliers or is not perfectly separable, SVM uses the soft
margin technique. This method introduces a slack variable for each data point to allow some
misclassifications while balancing between maximizing the margin and minimizing violations.
Q.22 Consider the use case of Email spam detection. Identify and explain the suitable
machine learning technique for this task. [Naive Bayes]
For the use case of email spam detection, Naive Bayes is a highly suitable machine learning
technique due to its simplicity, effectiveness, and efficiency in handling text classification
problems like spam filtering.
- Naive Bayes is a probabilistic classifier that uses Bayes' Theorem to calculate the
probability that a given email is spam or not, based on the words (features) it contains.
Given an email, Naive Bayes computes the probability of it being spam based on the
presence of certain words commonly associated with spam (e.g., "free," "winner,"
"discount").
- The "naive" aspect of Naive Bayes assumes that the presence of each word in an email
is independent of the presence of other words. While this assumption might not hold
perfectly, it simplifies the computations and has been shown to perform very well in
practice for text classification tasks.
- Emails have a large vocabulary (number of unique words), making them
high-dimensional data. Naive Bayes handles high-dimensionality effectively because it
only requires word probabilities, not the full vocabulary structure.
- Naive Bayes is computationally efficient and requires relatively little training data,
making it fast and easy to train even with large datasets.
Training Phase: A dataset with labeled examples of spam and non-spam emails is used for
training. For each word, Naive Bayes calculates the probability of the word appearing in spam
emails and the probability of it appearing in non-spam emails.
Prediction Phase: For a new email, Naive Bayes calculates the probabilities of the email being
spam or not based on the presence of each word in the email. The classifier then assigns the
label “spam” or “non-spam” based on the higher probability.
Example: Suppose we have a simple spam email dataset with two classes: spam and not
spam. If the email contains words like “winner,” “free,” and “prize” frequently in spam emails,
Naive Bayes will assign a high probability to these words being in spam emails.
So, if a new email contains the words “You are a winner! Claim your free prize,” Naive Bayes will
calculate a high probability that the email is spam based on these words' presence and classify
it as spam.
Q.23 Write a short note on optimization technique in machine learning [Explain Steepest
Descent method of optimization]
In machine learning, optimization techniques are methods used to minimize or maximize an
objective function, typically a cost or loss function. The goal of these techniques is to find the
optimal parameters for a model and reduce errors which lead to the best predictions.
Gradient Descent method or the Steepest Descent method, is one of the simplest and most
commonly used optimization techniques in machine learning.
It aims to find the minimum of a function by taking steps proportional to the negative of the
gradient (slope) of the function at the current point. This method is useful when dealing with
continuous variables.
Step 4: Repeat: Continue updating parameters iteratively until convergence (when the change
in parameters is very small, or the loss stops decreasing significantly).
Example: Consider minimizing a simple quadratic cost function, like a mean squared error. The
Steepest Descent method would compute the gradient at each point (for example, how the error
changes with respect to each weight in a linear regression) and then adjust the weights in the
direction that reduces the error. Over time, these steps bring the weights to values that minimize
the error, thus optimizing the model.
Example: Imagine we have two classes in a dataset that form concentric circles. A linear
boundary in the original 2D space would not separate these classes, but using an RBF kernel
transforms the data into a higher dimension where a hyperplane (linear boundary) can
effectively separate the classes.
Mod. 5
Q.25 Expectation Maximization Algorithm Short note (EM Algo) [VV IMP]
Q.26 DBSCAN or What is Density-based clustering ? Explain the steps used for
clustering using the DBSCAN algorithm. [VV IMP]
Euclidean Distance: Euclidean distance is the "straight-line" distance between two points in
Euclidean space. Common in k-means clustering and scenarios where the data has a
continuous, multidimensional structure.
Manhattan Distance: The Manhattan distance between two points is the sum of the absolute
differences of their coordinates. Useful for sparse data and cases where the directions of
changes are more important than exact distances. Often used in city-block grids or
high-dimensional spaces.
Mod. 6
Q.29 What is Dimensionality reduction ? Explain how it can be utilized for classification
and clustering tasks in Machine Learning.
Dimensionality reduction is the process of reducing the number of features (or dimensions) in
a dataset, while retaining as much of the original information as possible. This is essential when
working with high-dimensional data, as too many features can make models complex, increase
computation time, and lead to issues like overfitting. Dimensionality reduction helps simplify
models, improve performance, and make data more manageable and interpretable.
Example: Imagine a dataset with thousands of genes as features in a medical study. PCA can
be applied to reduce these thousands of features to a smaller set of components that capture
the core variance. This allows for effective visualization, easier clustering of patient data, and
more efficient modeling.
Q.31 Linear Discriminant Analysis for Dimension Reduction Theory / [Numerical] [VV
IMP]
OR Explain the Dimensionality Reduction technique of LDA and its real world
applications.
Linear Discriminant Analysis (LDA) is a powerful technique that can be used for both
dimensionality reduction and classification. While it is mainly known for classification, it can
also reduce the dimensions of data, just like Principal Component Analysis (PCA), but with a
key difference: LDA considers class labels in the process.
Steps of LDA:
Step 1: Compute the Mean of each class: For each class in the dataset, calculate the mean of
the features.
Step 2: Compute the Between-Class Scatter Matrix: Measures the spread of the class
means from the overall mean of the dataset. It tries to capture how different the class means are
from each other. Variance BETWEEN all the classes.
Step 3: Compute the Within-Class Scatter Matrix: Measures how much each class's data
points deviate from the mean of that class. Variance WITHIN a single class.
Step 4: Compute the Optimal Projection (Eigenvectors): By solving the eigenvalue problem
for the matrix, LDA computes discriminant directions that maximize the class separability,
whereas the eigenvectors represent the axes along which the classes are most distinguishable.
Step 5: Transform the Data: Choose the top k eigenvectors corresponding to largest
eigenvalues to project data into a lower-dimensional space.
Q.32 SVD numerical