Machine Learning
Machine Learning
LEARNIN
G
M a c h i n e L e a r n i n g i s a fi e l d o f c o m p u t e r
science that enables computers to learn
from data without being explicitly
programmed. It involves the development of
algorithms that can learn from data and
make decisions or predictions based on data.
Types of
Learning
Featues Supervised Learning Un - Supervised Learning
• Assumptions on Data:
• Linearity: The relationship between the
independent and dependent variables is linear.
• Independence: Observations are independent of
each other.
• Homoscedasticity: The variance of the errors is
constant across all levels of the independent
variables.
• Normality: The errors are normally distributed.
Aspects Ridge Regression Lasso Regression Elastic Net Regresiion
Objective Function
• Assumptions on Data:
• · Linearity: The log odds of the outcome is a linear
combination of the predictor variables.
• · Independence: Observations are independent of
each other.
• · Large Sample Size: Logistic regression requires a
large sample size to provide a good estimate of the
model parameters.
Naive
Bayes
Naive Bayes is a probabilistic machine learning
algorithm based on Bayes' theorem with the Key Formulas:
assumption of independence between features. 1.Key Formulas:
• Bayes' Theorem:
It's commonly used for classification tasks,
especially in text classification and spam filtering.
Working:
• SVM aims to find the hyperplane that maximizes
the margin between the classes.
• The margin is the distance between the
hyperplane and the nearest data point from each
class, known as support vectors. Kernel Trick:
• Support vectors are the critical data points that • SVM can be extended to non-linearly
determine the position and orientation of the separable data using a kernel function,
hyperplane. such as polynomial, radial basis function
• SVM seeks to maximize this margin, making it (RBF), or sigmoid kernel.
robust to outliers and generalizable to unseen • Kernel functions transform the input space
data. into a higher-dimensional space where the
data becomes linearly separable.
K-Nearest
Neighbors
K-Nearest Neighbors (KNN) is a simple, non-
parametric supervised learning algorithm used for
classification and regression tasks. It classifies a data
point based on the majority class of its k nearest
neighbors in the feature space.
Working:
• Given a new data point, KNN calculates the
distance (e.g., Euclidean distance) to all other
data points in the training set.
• It selects the k nearest neighbors based on the
calculated distances. Key Formulas:
• For classification, KNN assigns the class label that Euclidean Distance:
is most common among the k neighbors.
• For regression, KNN predicts the average value of
the target variable among the k neighbors.
Ensemble Technique:
• Random Forest is an ensemble technique
that combines the predictions of multiple
weak learners (decision trees) to improve
overall performance and robustness.
• It reduces overfitting and variance by
averaging or voting over multiple
independent models.
UN-
SUPERVISED
LEARNING
K-
Means
K-means clustering is an unsupervised
machine learning algorithm used for partitioning
a dataset into k distinct, non-overlapping
clusters. It aims to minimize the within-cluster
variance, or inertia, by iteratively assigning data
points to the nearest cluster centroid and
updating the centroids.
Working:
• Initialize k centroids randomly or based on some Key Formulas:
heuristic. Distance Metric: Commonly used distance metric is Euclidean
• Assign each data point to the nearest centroid, distance, but other metrics like Manhattan distance or cosine
forming k clusters. similarity can also be used.
• Update the centroids by computing the mean of all
data points assigned to each cluster.
• Repeat the assignment and update steps until
convergence, i.e., when the centroids no longer Within-Cluster Variance: Inertia is often used as a measure
change significantly or a maximum number of of clustering quality, defined as the sum of squared distances of
iterations is reached. samples to their closest cluster center.
K-
Means
The elbow plot is a graphical tool used to determine
the optimal number of clusters (k) in a K-Means
clustering algorithm. It helps identify the point where
increasing the number of clusters results in
diminishing returns, meaning the improvement in
clustering performance slows down, forming an
"elbow" shape in the plot.
Working:
• Perform K-Means clustering on the dataset for a range of
k values (typically from 1 to a chosen upper limit).
• For each k value, compute the within-cluster variance or Key Terms:
inertia (sum of squared distances from each point to its • Inertia: The sum of squared distances of each data
assigned centroid). point to its nearest centroid. It indicates how tightly
• Plot the value of inertia (on the y-axis) against the the clusters are packed.
number of clusters k (on the x-axis). • Diminishing Returns: The concept where after a
• The "elbow" point on the curve indicates the optimal k certain point, adding more clusters does not
value. After this point, adding more clusters provides significantly improve the clustering outcome.
only minimal reduction in inertia. • Elbow Point: The k value where inertia starts to
decrease at a slower rate, forming a sharp bend in the
plot.
Types:
Clustering
merges the closest pairs of clusters until the desired
Hierarchical Clustering is an unsupervised clustering number of clusters is reached.
algorithm that builds a hierarchy of clusters. It does not • Divisive Hierarchical Clustering: It starts with a
require the number of clusters K to be specified in advance. single cluster containing all data points and recursively
It can produce either a dendrogram (tree-like structure) or a splits the cluster into smaller clusters until each cluster
set of nested clusters. contains only one data point.
Working:
• Start with each data point as its own cluster, treating N
data points as N clusters.
• Merge the two closest clusters into a single cluster based
on a distance metric (e.g., Euclidean distance).
• Repeat the merging process until only a single cluster
remains or until a stopping criterion is met.
Hyperparameters:
• Distance Metric: The choice of distance metric can significantly affect the clustering results.
• Linkage Method: The method used to calculate the distance between clusters can impact the resulting cluster
structure.
• Stopping Criterion: Criteria such as the maximum number of clusters or a threshold distance can be used to stop
the clustering process.
DB-
SCAN
DBSCAN is a density-based clustering algorithm used to partition a dataset into clusters of varying shapes and
sizes.It does not require the number of clusters k to be specified beforehand. DBSCAN is capable of identifying noise
points, which do not belong to any cluster.
Working:
• DBSCAN defines clusters as dense regions of data points separated by regions of lower density.
• It requires two parameters: ε, the maximum distance between two points to be considered neighbors, and minPts,
the minimum number of points required to form a dense region (core point).
• The algorithm starts by randomly selecting a point from the dataset. If it has at least minPts neighbors within
distance ε, it becomes a core point, and a new cluster is formed.
• The algorithm expands the cluster by adding all reachable points (including core points and border points) within
distance ε to the cluster.
• If a core point is not reachable from any existing cluster, it becomes a new cluster.
• Points that are not core points and are not reachable from any cluster are considered noise points.
DB-
Key SCAN
Formulas :
• Reachability Distance: The reachability distance
between two data points p and q is defined as the
maximum of the core distance of q and the distance
between p and q.
• Reachability distance is calculated using the
formula:
Working:
• Standardize the data: Ensure that each feature (like age, Key Terms:
income, etc.) is on the same scale to avoid larger values • Principal Components: New variables
dominating the analysis. created by combining the original ones,
• Compute the covariance matrix: Calculate the relationships capturing the main patterns in the data.
between features to see how much they vary together. • Covariance Matrix: A table showing how
• Find principal components: Use mathematical methods to find much each pair of features varies together.
the directions (called principal components) where the data shows • Eigenvalues: Values that indicate how
the most variation. much information (variance) each principal
• Rank the components: The components are ranked based on component contains.
how much of the total variance they capture, using eigenvalues. • Eigenvectors: The directions or axes
• Reduce dimensions: Select the top k components that capture along which the principal components lie.
the most variance, and project the original data onto these
E VA L U AT I O N
METRICS
Regression
metrics
Basic metrics Given a regression
model f, the following metrics are
commonly used to assess the
performance of the model:
AUC : The area under the receiving operating curve, also noted AUC or AUROC, is the area below the
ROC as shown in the following figure
Model
have asselection :
When selecting a model, we distinguish 3 different parts of the data that we
follows:
Training Set Validation Set Testing Set
usually 80% of the dataset usually 20% of the dataset Unseen data
Once the model has been chosen, it is trained on the entire dataset and tested on the unseen test set.
These are represented in the figure below:
Cross-validation, also noted CV, is a method that is used to select a model that does not rely too much
on the initial training set. The different types are summed up in the table below:
Training on k-1 folds and assesment on the remaining one Training on n-p observations and assessment on p remaining ones
The most commonly used method is called K-folds cross-validation and splits the training data into K-
folds to validate the model on one fold while training the model on the K- 1 other folds, all of this K times.
The error is then averaged over the folds and is named cross-validation error.
Bias - Varaiance
Tradeoff
Bias - The bias of a model is the difference
between the expected prediction and the correct
model that we try to predict for given data points.