0% found this document useful (0 votes)
2 views

Machine Learning

Machine learning is a field of computer science that allows computers to learn from data through algorithms without explicit programming. It includes supervised learning, which uses labeled data for predictions, and unsupervised learning, which identifies patterns in unlabeled data. Common algorithms include linear regression, logistic regression, decision trees, and K-means clustering, each with specific applications and evaluation methods.

Uploaded by

ankushsonawane36
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Machine Learning

Machine learning is a field of computer science that allows computers to learn from data through algorithms without explicit programming. It includes supervised learning, which uses labeled data for predictions, and unsupervised learning, which identifies patterns in unlabeled data. Common algorithms include linear regression, logistic regression, decision trees, and K-means clustering, each with specific applications and evaluation methods.

Uploaded by

ankushsonawane36
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 32

MACHINE

LEARNIN
G
M a c h i n e L e a r n i n g i s a fi e l d o f c o m p u t e r
science that enables computers to learn
from data without being explicitly
programmed. It involves the development of
algorithms that can learn from data and
make decisions or predictions based on data.
Types of
Learning
Featues Supervised Learning Un - Supervised Learning

Data Type Labeled Un-labeled

Discover hidden patterns and


Learning Goal Make predictions for new data
structures within data

Exploring and making sense of the


Analogy Learning with a teacher's guidance
world on your own

The model is trained on a dataset


The model is trained on a dataset
where data points have no
where each data point has a
predefined labels. The model
corresponding label (desired
Model Training identifies similarities and
output). The model learns the
differences between data points to
relationship between the input
group them or uncover underlying
features and the labels.
structures.
Types of
Learning
Featues Supervised Learning Un - Supervised Learning

The effectiveness of the model is


The model's performance is
evaluated based on how well it
evaluated on unseen data to
Evaluation achieves the desired outcome,
assess its ability to generalize and
such as identifying distinct clusters
make accurate predictions.
or meaningful patterns.

Linear Regression, Logistic K-Means Clustering, Hierarchical


Common Algorithms Regression, Decision Trees, Support Clustering, Principal Component
Vector Machines (SVM) Analysis (PCA), Apriori algorithm

Customer segmentation: Groups


Spam filter: Classifies emails as customers with similar
spam or not spam based on labeled characteristics based on unlabeled
training data. customer data.
Example
Stock price prediction: Predicts Anomaly detection: Identifies
future stock prices based on unusual patterns in data,
historical labeled data. potentially indicating fraud or
system failures.
SUPERVISED
LEARNING
Linear
Regression
Linear Regression models the relationship between a
dependent variable and one or more independent
variables by fitting a linear equation to observed data. It
is used to predict a continuous outcome based on input
features.

• Assumptions on Data:
• Linearity: The relationship between the
independent and dependent variables is linear.
• Independence: Observations are independent of
each other.
• Homoscedasticity: The variance of the errors is
constant across all levels of the independent
variables.
• Normality: The errors are normally distributed.
Aspects Ridge Regression Lasso Regression Elastic Net Regresiion

Linear regression with Linear regression with


Linear regression with a
regularization term that regularization term that
Definition combination of L1 and L2
penalizes the L2 norm of the penalizes the L1 norm of the
regularization terms.
coefficients. coefficients.

Objective Function

Shrinks coefficients towards Induces sparsity by setting Combines benefits of Ridge


Shrinkae Effect zero, but does not set them some coefficients to exactly and Lasso, allowing for both
exactly to zero. zero. shrinkage and sparsity.

Less effective for feature


Effective for feature Balances between Ridge and
selection, as it does not force
Feature Selection selection, as it can eliminate Lasso in terms of feature
coefficients to be exactly
irrelevant features. selection.
zero.

When there are many


When all features are When there are many
features and multicollinearity
suaitable for potentially relevant and irrelevant features or when a
is present, but also when
multicollinearity is present. sparse solution is desired.
feature selection is desired.
Logistic
Regression
Logistic regression is used for binary classification
problems. It models the probability of an event occurring
by fitting data to a logistic function ( sigmoid function ).

• Assumptions on Data:
• · Linearity: The log odds of the outcome is a linear
combination of the predictor variables.
• · Independence: Observations are independent of
each other.
• · Large Sample Size: Logistic regression requires a
large sample size to provide a good estimate of the
model parameters.
Naive
Bayes
Naive Bayes is a probabilistic machine learning
algorithm based on Bayes' theorem with the Key Formulas:
assumption of independence between features. 1.Key Formulas:
• Bayes' Theorem:
It's commonly used for classification tasks,
especially in text classification and spam filtering.

• where 𝐶𝑘​is the class label, 𝑋 is the input features,


𝑃(𝐶𝑘∣𝑋) is the posterior probability of class 𝐶𝑘​given
Working:

features 𝑋, 𝑃(𝑋∣𝐶𝑘) is the likelihood of features given class


• Naive Bayes calculates the probability of a data

𝐶𝑘​, 𝑃(𝐶𝑘) is the prior probability of class 𝐶𝑘​, and 𝑃(𝑋) is


point belonging to each class based on the feature

the probability of features 𝑋.


values.
• It assumes that the features are conditionally
independent given the class label. • Independence Assumption:
• To classify a new data point, Naive Bayes selects

• assuming features 𝑋1,𝑋2,...,𝑋𝑛 are conditionally


the class with the highest posterior probability

independent given class 𝐶𝑘​.


using Bayes' theorem.
Types:
• Gaussian Naive Bayes: Assumes that continuous features follow a Gaussian distribution.
• Multinomial Naive Bayes: Suitable for discrete features, commonly used in text classification with word counts or TF-
IDF values.
• Bernoulli Naive Bayes: Similar to Multinomial Naive Bayes but assumes binary features (e.g., presence or absence of
words).
Support Vector
Support VectorMachine
Machine (SVM) is a supervised
machine learning algorithm used for classification
and regression tasks. It finds a hyperplane in an N-
dimensional space (N is the number of features) that
best separates the data points into classes.

Working:
• SVM aims to find the hyperplane that maximizes
the margin between the classes.
• The margin is the distance between the
hyperplane and the nearest data point from each
class, known as support vectors. Kernel Trick:
• Support vectors are the critical data points that • SVM can be extended to non-linearly
determine the position and orientation of the separable data using a kernel function,
hyperplane. such as polynomial, radial basis function
• SVM seeks to maximize this margin, making it (RBF), or sigmoid kernel.
robust to outliers and generalizable to unseen • Kernel functions transform the input space
data. into a higher-dimensional space where the
data becomes linearly separable.
K-Nearest
Neighbors
K-Nearest Neighbors (KNN) is a simple, non-
parametric supervised learning algorithm used for
classification and regression tasks. It classifies a data
point based on the majority class of its k nearest
neighbors in the feature space.

Working:
• Given a new data point, KNN calculates the
distance (e.g., Euclidean distance) to all other
data points in the training set.
• It selects the k nearest neighbors based on the
calculated distances. Key Formulas:
• For classification, KNN assigns the class label that Euclidean Distance:
is most common among the k neighbors.
• For regression, KNN predicts the average value of
the target variable among the k neighbors.

Majority Voting: For classification, the class of a data point


is determined by the majority class among its k nearest
neighbors.
Decision
TreeTree is a supervised machine learning
Decision Key Formulas:
algorithm used for both classification and regression • Gini Impurity:
tasks. It partitions the feature space into regions,
assigning a label or value to each region based on
the majority class or average target value of the • where 𝑝𝑖​is the proportion of samples in class 𝑖i at a
training instances within that region. particular node.
• Entropy:
Working:
• Decision Tree recursively splits the feature space

• where 𝑝𝑖​is the proportion of samples in class 𝑖i at a


into subsets, based on the values of features, in a
hierarchical manner.
particular node.
• At each node of the tree, it selects the feature
• Information Gain:
that best splits the data into homogeneous
subsets, using a criterion such as Gini impurity,
• where 𝐷 is the dataset, 𝐴 is a feature, 𝐷𝑣​is the
entropy, or information gain.

subset of 𝐷 where feature 𝐴 takes value 𝑣, and 𝐼 is


• The splitting process continues until a stopping
criterion is met, such as reaching a maximum tree
the impurity measure.
depth, minimum number of samples in a node, or
no further improvement in impurity reduction.
Decision
Tree
Splitting Criteria:

• Gini Impurity: Measures the probability of a


randomly chosen sample being incorrectly
classified.

• Entropy: Measures the average amount of


information needed to classify a sample.

• Information Gain: Measures the reduction


in impurity achieved by splitting the data on
a particular feature.
ENSEMBLE
LEARNING
Baggi
ng
bootstrapping refers to a sampling method that
involves drawing random samples from a dataset
with replacement.

Bootstrap aggregating, more commonly known as


bagging, is a technique that directly leverages the
concept of bootstrapping to create and train a weak
learner on each of the individual subsets (sometimes
referred to as “bags”). A weak learner in this context is
defined as an algorithm that performs just slightly better
than random guessing.

Bagging is a parallel process, meaning that the models


are created in parallel on these subsets and are
independent of one another.

After fitting a model to each of the bootstrapped subsets,


their respective results are combined, or aggregated, in
order to obtain the final results.
Boostin Compared to bagging, boosting generally does not make use of
bootstrapping and follows a sequential process, where each
g subsequent model tries to correct the errors of the previous
model and reduce its bias.

First, a weak learner is fitted on the original training data and


subsequently evaluated by comparing its predictions to the
actual values. During this initial iteration, all samples are given
equal weights.

Next, we increase the weights of the misclassified samples and


decrease the weight of the correctly classified ones. The
samples that were misclassified will thus have higher weights in
the next iteration, thereby “boosting” their importance.

This process is then repeated for a pre-defined number of


iterations, or until the models’ predictions reach a desired level
of accuracy. Once all the models are trained, their predictions
are combined to produce the final output. Typically, the
prediction of each individual model is weighted based on its
accuracy, with more accurate models contributing more to the
final prediction.
Random
Key Formulas:
Forest
Random Forest is an ensemble learning method • Random Forest does not have specific key
used for both classification and regression tasks. It formulas, but it utilizes the same formulas as
operates by constructing a multitude of decision Decision Trees for splitting criteria, such as Gini
trees during training and outputs the class that is the impurity, entropy, or information gain.
mode of the classes (classification) or mean
prediction (regression) of the individual trees. Hyperparameters:
• Number of Trees: The number of decision trees
Working:
in the forest.
• Random Forest builds multiple decision trees
• Max Depth: The maximum depth allowed for
using a technique called bootstrap aggregation
each decision tree.
(bagging).
• Minimum Samples Split: The minimum number
• Each tree is trained on a random subset of the
training data (bootstrap sample) and a random of samples required to split a node.
subset of features at each split. • Minimum Samples Leaf: The minimum number
• During prediction, each tree in the forest of samples required to be at a leaf node.
independently predicts the class label (or • Max Features: The number of features to
regression value) of a new data point. consider when looking for the best split.
• The final prediction is determined by averaging (in • Bootstrap Sampling: Whether to use bootstrap
regression) or taking a majority vote (in sampling (with replacement) when building trees.
classification) over the predictions of all trees.
Random
Forest Splitting Criteria:
• Gini Impurity: Measures the probability of a
randomly chosen sample being incorrectly
classified.
• Entropy: Measures the average amount of
information needed to classify a sample.
• Information Gain: Measures the reduction
in impurity achieved by splitting the data on
a particular feature.

Ensemble Technique:
• Random Forest is an ensemble technique
that combines the predictions of multiple
weak learners (decision trees) to improve
overall performance and robustness.
• It reduces overfitting and variance by
averaging or voting over multiple
independent models.
UN-
SUPERVISED
LEARNING
K-
Means
K-means clustering is an unsupervised
machine learning algorithm used for partitioning
a dataset into k distinct, non-overlapping
clusters. It aims to minimize the within-cluster
variance, or inertia, by iteratively assigning data
points to the nearest cluster centroid and
updating the centroids.
Working:
• Initialize k centroids randomly or based on some Key Formulas:
heuristic. Distance Metric: Commonly used distance metric is Euclidean
• Assign each data point to the nearest centroid, distance, but other metrics like Manhattan distance or cosine
forming k clusters. similarity can also be used.
• Update the centroids by computing the mean of all
data points assigned to each cluster.
• Repeat the assignment and update steps until
convergence, i.e., when the centroids no longer Within-Cluster Variance: Inertia is often used as a measure
change significantly or a maximum number of of clustering quality, defined as the sum of squared distances of
iterations is reached. samples to their closest cluster center.
K-
Means
The elbow plot is a graphical tool used to determine
the optimal number of clusters (k) in a K-Means
clustering algorithm. It helps identify the point where
increasing the number of clusters results in
diminishing returns, meaning the improvement in
clustering performance slows down, forming an
"elbow" shape in the plot.

Working:
• Perform K-Means clustering on the dataset for a range of
k values (typically from 1 to a chosen upper limit).
• For each k value, compute the within-cluster variance or Key Terms:
inertia (sum of squared distances from each point to its • Inertia: The sum of squared distances of each data
assigned centroid). point to its nearest centroid. It indicates how tightly
• Plot the value of inertia (on the y-axis) against the the clusters are packed.
number of clusters k (on the x-axis). • Diminishing Returns: The concept where after a
• The "elbow" point on the curve indicates the optimal k certain point, adding more clusters does not
value. After this point, adding more clusters provides significantly improve the clustering outcome.
only minimal reduction in inertia. • Elbow Point: The k value where inertia starts to
decrease at a slower rate, forming a sharp bend in the
plot.
Types:

Hierarchical • Agglomerative Hierarchical Clustering: It starts


with individual data points as clusters and iteratively

Clustering
merges the closest pairs of clusters until the desired
Hierarchical Clustering is an unsupervised clustering number of clusters is reached.
algorithm that builds a hierarchy of clusters. It does not • Divisive Hierarchical Clustering: It starts with a
require the number of clusters K to be specified in advance. single cluster containing all data points and recursively
It can produce either a dendrogram (tree-like structure) or a splits the cluster into smaller clusters until each cluster
set of nested clusters. contains only one data point.

Working:
• Start with each data point as its own cluster, treating N
data points as N clusters.
• Merge the two closest clusters into a single cluster based
on a distance metric (e.g., Euclidean distance).
• Repeat the merging process until only a single cluster
remains or until a stopping criterion is met.

Hyperparameters:
• Distance Metric: The choice of distance metric can significantly affect the clustering results.
• Linkage Method: The method used to calculate the distance between clusters can impact the resulting cluster
structure.
• Stopping Criterion: Criteria such as the maximum number of clusters or a threshold distance can be used to stop
the clustering process.
DB-
SCAN
DBSCAN is a density-based clustering algorithm used to partition a dataset into clusters of varying shapes and
sizes.It does not require the number of clusters k to be specified beforehand. DBSCAN is capable of identifying noise
points, which do not belong to any cluster.

Working:
• DBSCAN defines clusters as dense regions of data points separated by regions of lower density.
• It requires two parameters: ε, the maximum distance between two points to be considered neighbors, and minPts,
the minimum number of points required to form a dense region (core point).
• The algorithm starts by randomly selecting a point from the dataset. If it has at least minPts neighbors within
distance ε, it becomes a core point, and a new cluster is formed.
• The algorithm expands the cluster by adding all reachable points (including core points and border points) within
distance ε to the cluster.
• If a core point is not reachable from any existing cluster, it becomes a new cluster.
• Points that are not core points and are not reachable from any cluster are considered noise points.
DB-
Key SCAN
Formulas :
• Reachability Distance: The reachability distance
between two data points p and q is defined as the
maximum of the core distance of q and the distance
between p and q.
• Reachability distance is calculated using the
formula:

• Where p and q are data points, and the core distance of


q (core-distance(q)) is the distance to its minPts nearest
neighbor.
• Core Distance: The core distance of a data point p is
the distance to its minPts nearest neighbor. Core
distance is represented as core-distance(p).
• Border Points: Border points are data points that are
not core points themselves but are within the ε-
neighborhood of a core point.
Principal Component
Analysis
Principal Component Analysis (PCA) is a dimensionality
reduction technique that simplifies complex datasets by
transforming the original variables into a smaller set of
new variables called principal components. These
components capture the most important patterns or
variance in the data while reducing the number of
features.

Working:
• Standardize the data: Ensure that each feature (like age, Key Terms:
income, etc.) is on the same scale to avoid larger values • Principal Components: New variables
dominating the analysis. created by combining the original ones,
• Compute the covariance matrix: Calculate the relationships capturing the main patterns in the data.
between features to see how much they vary together. • Covariance Matrix: A table showing how
• Find principal components: Use mathematical methods to find much each pair of features varies together.
the directions (called principal components) where the data shows • Eigenvalues: Values that indicate how
the most variation. much information (variance) each principal
• Rank the components: The components are ranked based on component contains.
how much of the total variance they capture, using eigenvalues. • Eigenvectors: The directions or axes
• Reduce dimensions: Select the top k components that capture along which the principal components lie.
the most variance, and project the original data onto these
E VA L U AT I O N
METRICS
Regression
metrics
Basic metrics Given a regression
model f, the following metrics are
commonly used to assess the
performance of the model:

Coefficient of determination - the coefficient of


determination, often noted or , provides a
measure of how well the observed outcomes are
replicated by the model and is defined as follows:
Classifiaction
Confusion Metrics
matrix : The confusion matrix is used
to have a more complete picture when assessing
the performance of a model.

The following metrics are commonly used to assess the performance of


classification models :
Classifiaction
Metrics
ROC : The receiver operating curve, also noted ROC, is the plot of TPR versus FPR by varying the
threshold.

AUC : The area under the receiving operating curve, also noted AUC or AUROC, is the area below the
ROC as shown in the following figure
Model
have asselection :
When selecting a model, we distinguish 3 different parts of the data that we
follows:
Training Set Validation Set Testing Set

Model is trained Model is assesed Model gives predictions

usually 80% of the dataset usually 20% of the dataset Unseen data

Also Called hold-out or development set

Once the model has been chosen, it is trained on the entire dataset and tested on the unseen test set.
These are represented in the figure below:
Cross-validation, also noted CV, is a method that is used to select a model that does not rely too much
on the initial training set. The different types are summed up in the table below:

K-fold Leave p-out

Training on k-1 folds and assesment on the remaining one Training on n-p observations and assessment on p remaining ones

Generally K=5 or 10 Case p =1 is called leave out-one

The most commonly used method is called K-folds cross-validation and splits the training data into K-
folds to validate the model on one fold while training the model on the K- 1 other folds, all of this K times.
The error is then averaged over the folds and is named cross-validation error.
Bias - Varaiance
Tradeoff
Bias - The bias of a model is the difference
between the expected prediction and the correct
model that we try to predict for given data points.

Variance - The variance of a model is the


variability of the model prediction for given data
points.

Bias/variance tradeoff - The simpler the model,


the higher the bias, and the more complex the
model, the higher the variance.
T h a n k
y o u !

You might also like