Machine Learning Cheatsheet Compiled and Curated by Robins Yadav
Machine Learning Cheatsheet Compiled and Curated by Robins Yadav
→ The training loss goes down over time, achieving low error values
© 2024 Robins Yadav - magic starts here → The validation loss goes down until a turning point is found, and it
starts going up again. That point represents the beginning of
My github: https://github.com/robinyUArizona overfitting. Therefore, The training process should be stopped when
the validation error trend changes from descending to ascending.
Machine Learning General
Definiton
Learns the decision boundary be- Learns the input distribution
We want to learn a target function f that maps input variables X to tween classes
output variable y, with an error e: Directly estimate P (y|x) Estimate P (x|y) to find likeli-
y = f (X) + e hood P (y|x) using Baye’s rule
Specifically meant for classifica- They are used for generating new
Linear, Non-linear tion tasks contents or data
Logistic Regression, Random Hidden Markov Models, Naive
Different algorithms make different assumptions about the shape and Forests, SVM, Neural Networks, Bayes, Gaussian Mixture Mod-
structure of f . Any algorithm can be either: Decision Tree, kNN els, Gaussian Discriminant Anal-
ysis, LDA, Bayesian Networks
• Paramteric (or Linear): simplify the mapping to a known
linear combination form and learning its coefficients. • Unsupervised learning methods learn to find the inherent structure
• Non-parametric (or Non-linear): free to learn any functional or hidden patterns from unlabeled data X (x(1) , ..., x(m) ) .
form from the training data, while maintaining some ability to
generalize.
Bias-Variance trade-off, Underfitting, Overfitting
Note: Linear algorithms are usually simpler, faster and requires less → Expected test error is the error we expect from predicting new,
data, while Nonlinear can be are more flexible, more powerful and unobserved data points.
more performant. In supervised learning, the prediction (expected) error e is composed
of the bias, the variance and the irreducible part.
Supervised, Unsupervised
2 h h ii2
• Supervised learning methods learn to predict outcomes y Error(x) = E[fˆ(x)] − f (x) + E fˆ(x) − E fˆ(x) + σe2
(y (1) , ..., y (m) ) from data points X (x(1) , ..., x(m) ) given that the
data is labeled. → Bias refers to erroneous assumptions made by the model about
the data to make the target function easier to learn.
Type of prediction Mathematically, how much predicted values differ from true values?
• Training loss vs. Validation loss:
→ Variance is the error, amount that the prediction (estimate of the
Regression Classification target function) will change if different training data sets were used.
Outcome Continuous Class It measures how scattered (inconsistent) are the predicted values from
Examples Linear Regression Logistic Regression, the correct value due to different training data (or possibly with
SVM, Naive Bayes different random seeds) sets. It is also known as Variance Error or
Error due to Variance.
→ Conditional Estimates Note: As the complexity of the model rises, the variance will
Regression → conditional expectation: E [y|X = x] increase and bias will decrease. → Epochs: One Epoch is when an ENTIRE dataset is passed forward
Classification → conditional probability: P (Y = y|X = x)
• The goal of parameterization is to achieve a low bias and low and backward through the neural network only ONCE.
Type of models variance trade-off through methods such as: → Batch: You can’t pass the entire dataset into the neural net at
→ Cross-validation can be used to tune models so as to optimize the once. So, you divide dataset into No. of Batches or sets or parts.
→ Discriminative Model: It focuses on predicting the labels of the trade-off → Iterations is the No. of Batches needed to complete One Epoch.
data. A discriminative machine learning trains a model which is done → Dimension reduction and feature selection Question: If total number of samples in a dataset is 1000 and batch
by learning parameters that maximize the conditional probability → Mixture models (probabilistic models) and ensemble learning. size is 10, how many iterations will be there in one epoch. Ans: 100
P (Y |X)
• Underfitting or High bias means that the model is not able to • How would you identify if your model is overfitting? By analyzing
→ Generative Model: It focuses on learning a probability distribution
capture or learn the trend or pattern in data. the learning curves, you should be able to spot whether the model is
for the dataset, it can reference this probability distribution to
• Overfitting or High variance means that the model fits the underfitting or overfitting. The y-axis is some metric of learning (ex:,
generate new data instances. A generative model learns parameters
available data but does not generalize well to predict on new data. classification accuracy) and the x-axis is experience (time or No. of
by maximizing the joint probability of P (X, Y ).
iteration).
• Regularization - Dropout During training, randomly set some
activations to 0. This forces network to not rely on any one node.
Unrepresentative Training Dataset → Leave-One-Out Cross-Validation (LOOCV): A special case where Prediction and Inference
k equals the number of data points, so each fold contains just one
When the data available during training is not enough to capture the data point. → Prediction uses a model to predict future observations.
model, relative to the validation dataset. • Model does not need to be valid
Data Science General • Evaluation does need to be valid • Quality & strength: accuracy of
predicting unseen data
Notes → Inference uses the model’s structure and parameters to learn or
understand an underlying phenomenon.
→ Parameters: Model parameters are numeric values as part of the
• Validity depends on assumptions
model. We estimate them from the data.
• Quality & strength: R2 , p-value, assumption checks, coefficients
→ Ablation: An ablation study is turning off components of a model
(e.g. features or sub-models) one at a time, to see how much each ETL - Extract Transform Load
contributes to the model’s performance.
→ Overfitting arises when our model learns too much, so it can’t An ETL workflow is essential for consolidating and preparing data for
generalize to new data. analysis, reporting, and business intelligence. Organizations can
The train and validation curves are improving, but there’s a big gap
→ Apply log-transformations (e.g., log, square root) to reduce ensure data accuracy, consistency, and availability by following a
between them, which means they operate like datasets from different
skewness and improve model performance, especially for models well-defined ETL process, enabling better decision-making and
distributions.
sensitive to outliers. insights.
Unrepresentative Validation Dataset → Apply transformations based on domain knowledge and 1. Extract Phase:
understanding of the data characteristics. Goal: Retrieve raw data from various source systems.
Preventing Data Leakage • Identify Data Sources: Determine the databases, files, APIs, or
other sources from which data will be extracte
• Proper Data Splitting: Split data into training, validation, and • Connect to Data Sources: Establish connections to these
test sets before performing any data preprocessing. For time sources using appropriate connectors or drivers.
series, split data chronologically. • Extract Data: Perform the actual data extraction, which may
• Transformation fit on Training Data Only: Ensure that involve running SQL queries, reading files, or calling API
transformations (e.g., scaling, encoding) are fit only on the endpoints.
training data and then applied to both training and test data • Handle Incremental Extraction: Extract only the new or
updated data since the last extraction to optimize performance
• Feature Selection: Ensure that features used for training are
As we can see, the training curve looks ok, but the validation function and reduce load.
available at the time of prediction. For time series data, create
moves noisily around the training curve. It could be the case that lag features that use only past information. Tools: SQL queries, Python scripts, database connectors, API
validation data is scarce and not very representative of the training • Target Leakage: Ensure that features used in training are clients.
data, so the model struggles to model these examples. available at the time of prediction and do not contain information 2. Transform Phase:
derived from the target variable. Goal: Clean, format, and prepare the data for loading into the
• Data Augmentation: Apply augmentation techniques only to the target system.
training set. • Data Cleaning: Remove duplicates, Handle missing values,
Correct errors and inconsistencies
Handle missing or corrupted data in a dataset • Data Integration: Merge data from different sources, Resolve
data conflicts and redundancies.
• Remove Missing Data: If only a small number of rows have
missing values, or If an entire column has a large percentage of • Data Transformation: Convert data types, Normalize or
missing values denormalize data, Apply business rules and calculations,
Aggregate data (e.g., summing, averaging).
• Impute Missing Data: Mean/Median/Mode Imputation,
• Data Enrichment: Add additional context or metadata, Join
Here, the validation loss is much better than the training one, which Imputation Using Algorithms, Forward/Backward Fill/ Interpolate
with reference data
reflects the validation dataset is easier to predict than the training For time series data
• Data Validation: Check for data quality and integrity, Ensure
dataset. An explanation could be the validation data is scarce but • Use Algorithms That Handle Missing Data: Decision transformations have been applied correctly
widely represented by the training dataset, so the model performs trees,Random forests
extremely well on these few examples. Tools: Python, SQL, transformation tools like Apache NiFi,
• Replace Corrupted Data: Identify Outliers using statistical Talend, or Apache Spark.
• What is cross-validation? Why it’s important? methods (like z-scores or IQR), Apply manual inspection and
Cross-validation evaluates a model’s performance. correction of corrupted values based on external sources or 3. Load Phase:
→ The idea is to divide the dataset into k subsets or ”folds”, train domain expertise. Goal: Load the transformed data into the target data warehouse or
the model on k − 1 of these folds, and test on the remaining fold to • Data Augmentation: To create synthetic data based on existing database.
ensure that the model generalizes well to unseen data. patterns • Prepare Target Schema: Define the schema of the target
→ After evaluating on all k folds, performance metrics are averaged database or data warehouse, Create tables, indexes, and
for a robust estimate of the model’s effectiveness. Inference vs. Classification partitions as needed
→ K-Fold Cross-Validation: The data is divided into k equally-sized • Load Data: Insert transformed data into the target tables, Use
folds. → Inference: Given two groups, what is the differences between these bulk load methods for efficiency when dealing with large
→ Stratified K-Fold: Similar to K-Fold, but it maintains the groups? t-tests, paired t-test etc. datasets
proportion of classes in each fold, making it ideal for imbalanced → Classification: Given a new animal, find whether new animal is cat • Post-Load Validation: Verify that the data has been loaded
datasets. or dog? correctly, Check for any discrepancies or errors
• Indexing and Partitioning: Optimize the target database for TP + TN
(1) Accuracy: → Ratio of correct
query performance, Create necessary indexes and partitions TP + TN + FP + FN
Tools: Database clients, bulk load utilities, ETL tools with load predictions over total predictions.
functionality. Estimate of P [D = Y ] , probability of decision is equal to outcome.
Precision × Recall
(4) F1-Score = 2 × → False positive (FP) and
Precision + Recall
Ex: We assume the null hypothesis H0 is true. False negative (FN) are equally important.
→ H0 : Person is not guilty How to choose threshold for the logistic regression? The choice of a
→ H1 : Person is guilty threshold depends on the importance of TPR and FPR classification
FP
(5) False Positive Rate: — Fraction of negatives problem. For example: Suppose you are building a model to predict
TN + FP customer churn. False negatives (not identifying customers who will
wrongly classified positive. Probability: P [D = 1|Y = 0] churn) might lead to loss of revenue, making TPR crucial. In
contrast, falsely predicting churn (false positives) could lead to
unnecessary retention efforts, making FPR important. If there is no
FN external concern about low TPR or high FPR, one option is to weight
(6) False Negative Rate: = 1-Recall — Fraction of
TP + FN them equally by choosing the threshold that maximizes TPR−FPR.
positives wrongly classified negative. Probability: P [D = 0|Y = 1] Precision-Recall curve: - Focuses on the correct prediction of the
minority class, useful when data is imbalanced. Plot precision at
different thresholds.
TN
(7) Specificity: = 1-FPR — Fraction of negatives
TN + FP
rightly classified negative. Probability: P [D = 0|Y = 0]
FP
(8) ”Fradulent transaction detector”, FPR = →
FP + TN
probability of falsely rejecting ”Null Hypothesis” H0
(9) ROC-curve: What FPR must you tolerate for a certain TPR? An
ROC curve plots TPR vs. FPR at different classification thresholds α.
→ Lowering the classification threshold classifies more items as
positive, thus increasing both False Positives and True Positives.
Regression Problems Optimization
Almost every machine learning method has an optimization algorithm
1X
1. Mean Squared Error: MSE = (yi − ŷ) at its core.
n i → Hypothesis : The hypothesis is noted hθ and is the model that
we choose. For a given input data x(i) the model prediction output is
hθ (x(i) ) .
s
PN
i=1 (yˆi − yi )2
2. Root Mean Squared Error: RMSE =
N → Loss function : L : (ŷ, y) ∈ R × Y 7−→ L(ŷ, y) ∈ R that takes
as inputs the predicted value ŷ corresponding to the real data value
1X y and outputs how different they are. The loss function is the Tips :
3. Mean Absolute Error: MAE = |yi − ŷ|
n i function that computes the distance or difference between the • Change learning rate α (”size of jump” at each iteration)
current output ŷ of the algorithm and the expected output y. • Plot Cost vs. Time to assess learning rate performance.
X The common loss functions are summed up in the table below: • Rescaling the input variables
4. Sum of Squared Error: SSE = (yi − ŷ)2
• Reduce passes through training set with SGD
i Least squared error Logistic loss Hinge loss
1 • Average over 10 or more updated to observe the learning trend
(y − ŷ)2 log (1 + exp(−y ŷ)) max(0, 1 − y ŷ)
X 2 while using SGD
5. Total Sum of Squares: SST = (yi − ȳ)2 Linear Regression Logistic Regression SVM • Stochastic Gradient Descent apply the procedure of parameter
i updating for each observation.
→ Time Complexity: O(km2 ) → m is the sample of data selected
6. R2 Error : randomly from the entire data of size n
MSE (model) SSE → It only uses a single point to compute gradients, leading to
R2 = 1 − R2 = 1 − smoother convergence and faster compute speeds.
MSE(baseline) SST
• Mini-batch Gradient Descent trains on small subsets of the data,
striking a balance between the approaches.
→ Cost function : The cost function J is commonly used to know
7. Adjusted R2 :
the performance of a model, and is defined with the loss function L Ordinary Least Squares
n−1
as follows:
Ra2 =1− (1 − R ) 2 m Least Squares Regression
n−k−1
X
J(θ) = L(hθ (x(i) ), y (i) ) We fit linear models:
i=1 X
ŷ = β0 + βj xj
Variance, R2 and the Sum of Squares j
Convex & Non-convex
• The total sum of squares: SStotal = i (yi − ȳ)2
P
Here, βj is the j-th coefficient and xj is the j-th feature.
1 P A convex function is one where a line drawn between any two points
• This scales with variance: var(Y ) = n i (yi − ȳ)2 Ordinary Least Squares - find β ⃗ that minimizes squared error:
P ˆ on the graph lies on or above the graph. It has one minimum. A
• The regression sum of squares: SSreg = (yi − ȳ)2
i non-convex function is one where a line drawn between any two X
, → nVar(predictions) arg min (yi − yˆi )2
points on the graph may intersect other points on the graph. It ⃗
β i
• The residual
P sum of squares (squared errro): characterized as ”wavy”
SSresid = i (yi − yˆi )2 , → nVar(ϵ) → When a cost function is non-convex, it means that there is a
Note: ϵ̄ = 0, E[ŷ] = ȳ likelihood that the function may find local minima instead of the
SStotal = SSreg + SSresid global minimum, which is typically undesired in machine learning
models from an optimization perspective.
SSresid SSreg nV ar(P reds) V ar(P reds) Gradient Descent
R2 = 1 − = = =
SStotal SStotal nV ar(Y ) V ar(Y )
Gradient Descent is used to find the coefficients of f that minimizes
→ Explained Variance: R2 quantifies how much of the variability in a cost function (for example MSE, SSR).
the outcome (dependent) variable can be explained by the predictor → Time Complexity: O(kn2 ) → n is no. of data points.
(independent) variables. → It minimizes the average loss by moving iteratively in the direction
→ An R2 of 1 indicates a perfect fit, where the model explains all the of steepest descent, controlled by the learning rate γ (step size).
variability, while an R2 of 0 indicates that the model explains none of Note, γ can be updated adaptively for better performance. ⃗ = ŷ
Goal: least-squares solution to : X β
the variability. Procedure:
→ Goodness of Fit: A higher R2 value generally suggests a better fit Solution: solve the normal equations:
of the model to the data, meaning the model’s predictions are closer 1. Intialization θ = 0 (coefficients to 0 or random) ⃗ = X T ŷ ⃗ = (X T X)−1 X T ŷ
XT Xβ →β
to the actual values.
2. Calculate cost J(θ) = evaluate f (coefficients)
→ R2 is not valid for nonlinear models as ⃗ |X,
L(β ⃗ ŷ)
∂
3. Gradient of cost ∂θj
J(θ) we knows the uphill direction
SSresidual + SSerror ̸= SST General Optimization:
∂
4. Update coeff θj = θj − α ∂θ J(θ) we go downhill 1. Understand data (features and outcome variables)
j
→ Drawback: R-squared will always increase when a new predictor 2. Define loss (or gain/utility) function
The cost updating process is repeated until convergence (minimum
variable is added to the regression model. 3. Define predictive model
found).
4. Search for parameters that minimize loss function Conditioning on Parameters Linear Regression
⃗ and write function:
Fuller definition - condition on parameters β
Augmented Loss yˆi = β0 + β1 xi1 + β2 xi2 · · · + βp xip + ϵ
We can add more things to the loss function ⃗ = ŷ = m(x, β)
P (Y = 1|x, β) ⃗ = logistic(...)
p
• Penalize model complexity n X
X
yˆi = β0 + βj xij
• Penalize ”strong” beliefs Likelihood Function i=1 j=1
– Requires predictive utility to overcome them Given data y = ⟨y1 , ..., yn ⟩, x = ⟨x1 , ..., xn ⟩ and parameters β̂
Here, n is total no. of observation, yi is dependent variable, xij is
→ Least squares generalizes into minimizing loss functions. ⃗ = P (y, x|β)
⃗ ∝ P (y|x, β)
⃗ =
Y
⃗ explanatory variable of j-th features of the i-th observation. β0 is
→ This is the heart of machine learning, particularly supervised Likelihood(y, x, β) P (yi |xi , β)
i
intercept or usually called bias coefficient.
learning. Assumptions:
This is weird: → Linear models make four key assumptions necessary for inferential
Maximum Likelihood Estimation MLE ⃗ ∝ P (y|x, β)
P (y, x|β) ⃗ validity.
• Linearity — outcome y and predictor X have linear relationship.
Any probability distribution has parameters, so fitting parameters is ⃗ = P (y|x, β)P
P (y, x|β) ⃗ (x|β)
⃗ • Independence — observations are independent of each other
an extremely crucial part of data analysis. There are two general - Independent variables (features) are not highly correlated with each
methods for doing so. In maximum likelihood estimation (MLE), the ⃗ = P (x). And x is fixed, so
But x is independent of params, so P (x|β)
P (x) is an (unknown) constant. other → Low multicollinearity
goal is to estimate the most likely parameters given a likelihood • Normal errors — residuals are normally distributed - check with
function: Q-Q plots. Violation means line (in Q-Q plots) still fits but p-value
⃗ = log Likelihood(y, x, β)
LogLik(y, x, β) ⃗
θMLE = arg max L(θ) , where L(θ) = fn (x1 , . . . xn |θ) and CIs are unreliable
Y • Equal variance — residuals have constant variance (called
Since the values of X are assumed to be i.i.d., then the likelihood = log P (x) ⃗
P yi |xi , β homoskedasticity; violation is heteroskedasticity) - check scatterplot
function becomes the following: i or regplot between residuals vs. fitted. Violations means model is
X
⃗
failing to capture a systematic effect. → These violations are problem
n = log P (x) + log P yi |xi , β
Y only for inference not for prediction
L (θ) = f (xi |θ) i
i=1
Maximum Likelihood Estimator
The natural log of L(θ) is then taken prior to calculating the X
maximum; since log is a monotonically increasing function, arg max ⃗
logP yi |xi , β
⃗
β
maximizing the log-likelihood log L(θ) is equivalent to maximizing i
the likelihood:
n
X P (Y = yi |X = xi ) = yˆi yi (1 − yˆi )1−yi
log L(θ) = log f (xi |θ)
i=1 logP (Y = yi |X = xi ) = yi log yˆi + (1 − yi ) log (1 − yˆi )
→ MLE is used to find the estimators that minimized the likelihood Variance Inflation Factor : Measures the severity if multicollinearity
Model log likelihood is sum over training data. Applicable to any
function: L(θ|x) = fθ (x) density function of the data distribution 1
model where ŷ = P (Y = 1|x) → , where Ri2 is found by regressing Xi aganist all other
1 − Ri2
Maximum a Posterior MAP
variables (a common VIF cutoff is 10)
Another way of fitting parameters is through maximum a posterior Linear Algorithms Learning: Estimating the coefficients β from the training data using
estimation (MAP), which assumes a ”prior distribution”: the optimization algorithm Gradient Descent or Ordinary Least
Squares.
θMAP = arg max g(θ) f (x1 . . . xn |θ) Regression Ordinary Least Squares - where we find β ⃗ that minimizes squared
error: X
where the similar log-likelihood is again employed, and g(θ) is a → Regression predicts (or estimates) a continuous variable arg min (yi − yˆi )2
density function of θ. ⃗
β i
Dependent variable Y , Independent variable(s) X
Log Likelihood → compute estimate ŷ ≈ y
yˆi = β0 + β1 xi
Logistic Regression:
X yi = yˆi + ϵi
P (Y = 1|X = x) = ŷ = logistic β0 + βj xj
j Here, β0 is intercept, β1 P
is slope and ϵ is residuals. The goal is to
The model computes probability of yes. learn β0 , β1 to minimize ϵ2i (least squares)
Probability of Observed Linearity: A linear equation of k + 1 variables is of the form:
What if we want P (Y = yi ), regardless of whether yi is 1 or 0?
P (Y = yi |X = xi ) = yˆi yi (1 − yˆi )1−yi ŷ = β0 + β1 x1 + · · · + βk xk
• ŷi is model’s estimate of P (Y = 1|X = xi ) → The dimension of the hyperplane of the regression is its
It is the sum of scalar multiples of the individual variables - aline! complexity.
• yi ∈ {0, 1} is outcome
→ Linear models are remarkably capable of transforming many Variations: There are extensions of Linear Regression training called
y
• ŷi i is ŷi if yi = 1, and 1 if yi = 0 — multiplicative if non-linear problems into linear. regularization methods, that aim to reduce the complexity of the
models or to address over-fitting in ML. The regularizer is not → Ex: Odds(failure) = x → means x:1 aganist success The representation below is an equation with binary output, which
dependent on the data. → In relation to the bias-variance trade-off, • Log Odds or logit → actually models the probability of default class:
regularization aims to decrease complexity in a way that significantly P (A) Assumptions:
reduces variances while only slightly increasing bias. log Odds(A) = log = logP (A) − log (1 − P (A)) - Linear relationship between X and log-odds of Y
1 − P (A)
→ Standardize numeric variables when using regularization because - Observations must be independent to each other
to ensure that 0 is a neutral value, so a low coefficient means ”little • Logistic: The inverse of the logit (logit− 1): - Low multicollinearity
effect when deviating from average”. So values, and therefore
Learning: Learning the logistic regression coefficients is done by:
coefficients, are on the same scale (# of standard deviations), to 1 ex → Minimizing the logistic loss function
properly distribute weight between them. logistic(x) = = x
1 + e−x e +1 X
⃗ i)
→ Multicollinearity → correlated predictors. Problem: Which arg min log 1 + exp(−yi βx
coefficient gets the common effect? To solve: Loss and ⃗
β i
Regularization comes.
→ Maximizing the log likelihood of the training data given the model
• Ridge Regression (L2 regularization): where OLS is modified
to minimize the squared sum of the coefficients
X
arg max ⃗
log P yi |xi , β
sigmoid or logistic curve.
n p p p ⃗
β i
X X X X
2
(yi − β0 − βj xij ) + λ βj2 = RSS + λ βj2
i=1 j=1 j=1 j=1 P (Y = yi |X = xi ) = yˆi yi (1 − yˆi )1−yi
logP (Y = yi |X = xi ) = yi log yˆi + (1 − yi ) log (1 − yˆi )
→ Prevents the weights from getting too large (L2 norm). If
lambda is very large then it will add too much weight and it → Odds are another way of representing probabilities. Model log likelihood is sum over training data. Applicable to any
will lead to under-fit. → The logistic and logit functions convert between probabilities and model where ŷ = P (Y = 1|x)
1 log-odds. Data preparation:
λ∝ • General Linear Models (GLMs): - Probability transformation to binary for classification
model variance
- Remove noise such as outliers
yˆi = g −1 (β0 + β1 xi1 + β2 xi2 · · · + βp xip )
• Lasso Regression (L1 regularization) : where OLS is modified Advantages:
to minimize the sum of the coefficients + Good classification baseline considering simplicity
p + Possibility to change cutoff for precision/recall tradeoff
n p p p X
X X X X ŷi = g −1 β0 + βj xij + Robust to noise/overfitting with L1/L2 regularization
(yi − β0 − βj xij )2 + λ |βj | = RSS + λ |βj |
j=1 + Probability output can be used for ranking
i=1 j=1 j=1 j=1
Usecase examples:
where p is the no. features (or dimensions), λ ≥ 0 is a tuning Here, g is a link function - Customer scoring with probability of purchase
parameters to be determined. • Counts: Poision regression, log link func - Classification of loan defaults according to profile
• Binary: Logistic regression, logit link func and g −1 is logistic func
→ Lasso shrinks the less important feature’s coefficient to
→ In logistic regression, a linear output is converted into a probability Linear Discriminant Analysis
zero thus, removing some feature altogether. If lambda is very
between 0 and 1 using the sigmoid or logistic function.
large value will make coefficients zero hence it will under-fit. For multiclass classification, LDA is the preferred linear technique.
→ L1 is less likely to shrink coefficients to 0. Therefore L1 X Representation: LDA representation consists of statistical properties
regularization leads to sparser models. P (yi = 1|X) = ŷi = logistic β0 + βj xij calculated for each class: means and the covariance matrix:
j
Data preparation: 1 X
n
1 X
n
- Transform data for linear relationship (ex: log transform for µk = xi σ2 = (xi − µk )2
nk i=1 n − k i=1
exponential relationship) eβ0 +β1 x1 +···+βi xi
- Remove noise such as outliers p(X) = = p(y = 1 | X)
1 + eβ0 +β1 x1 +···+βi xi
- Rescale inputs using standardization or normalization
Advantages: Note : Coefficients are linearly related to odds, such that a one unit
+ Good regression baseline considering simplicity increase in x1 affects odds by eβ1 .
+ Lasso/Ridge can be used to avoid overfitting Note : The coefficients in logistic regression are interpreted in
+ Lasso/Ridge permit feature selection in case of collinearity terms of their effect on the log-odds of the outcome, and the
exponentiated coefficients (odds ratios) provide a clearer
Usecase examples: understanding of the change in odds associated with each predictor.
- Product sales prediction according to prices or promotions
- Call-center waiting-time prediction according to the number of
complaints and the number of working agents
Logistic Regression
Log-Odds and Logistics
• Odds
The probability of success P (S): 0 ≤ p ≤ 1
→ The odds of success are defined as the ratio of the probability of
success over the probability of failure. LDA assumes Gaussian data and attributes of same σ 2 . Predictions
P (S) P (S)
The odds of success: Odds(S) = P (S c ) = 1−P (S) are made using Bayes Theorem:
P (k) × P (x|k) 1 (x − µ)2
P (y = k | X = x) = Pk f (x | µ(x), σ) = √ e−
l=1 P (l) × P (x|l) 2πσ 2 2σ 2
v
to obtain a discriminate function (latent variable) for each class k, n u n
1X u1 X
estimating P (x|k) with a Gaussian distribution: µ(x) = xi σ=t (xi − µ(x))2
n i=1 n i=1
µk µ2
Dk (x) = x × 2
− k2 + ln(P (k))
σ 2σ Data preparation:
- Change numerical inputs to categorical (binning) or near Gaussian
The class with largest discriminant value is the output class. inputs (remove outliers, log & boxcox transform)
Variations: - Other distributions can be used instead of Gaussian The prediction function is the signed distance of the new input x to
- Log-transform of the probabilities can avoid overflow the separating hyperplane w, with b the bias:
1. Quadratic DA: Each class uses its own variance estimate
- Probabilities can be updated as data becomes available f (x) = ⟨w, x⟩ + b = wT x + b
2. Regularized DA: Regularization into the variance estimate.
Advantages:
Data preparation: → Optimal margin classifier: The optimal margin classifier h is such
+ Fast because of the calculations
- Review and modify univariate distributions to be Gaussian that:
+ If the naive assumptions works can converge quicker than other h(x) = sign(wT x − b)
- Standardize data to µ = 0, σ = 1 to have same variance models. Can be used on smaller training data.
- Remove noise such as outliers + Good for few categories variables where (w, b) ∈ Rn × R is the solution of the following optimization
Advantages: problem:
Usecase examples: 1
+ Can be used for dimensionality reduction by keeping the latent min ∥w∥2
- Article classification using binary word presence 2
variables as new variables - Email spam detection using a similar technique such that
Usecase example:
- Prediction of customer churn y (i) (wT x(i) − b) ≥ 1
Likelihood and Posterior
Learning:
Nonlinear Algorithms P (y|θ) P (θ) → Hinge loss : The hinge loss is used in the setting of SVMs and is
P (θ|y) = defined as follows:
All Nonlinear Algorithms are non-parametric and more flexible. They P (y)
L(ŷ, y) = [1 − y ŷ]+ = max(0, 1 − y ŷ)
are not sensible to outliers and do not require any shape of • P (θ) is the prior
distribution. • P (y|θ) isR the likelihood – how likely is the data given params θ
• P (y) = P (y|θ)P (θ)dθ is a scaling factor (constant for fixed y) → Lagrangian : We define the Lagrangian L(w, b) as follows:
Naive Bayes Classifier • P (θ|y) is the posterior. l
X
L(w, b) = f (w) + βi hi (w)
Naive Bayes is a classification algorithm interested in selecting the • We’re maximizing likelihood (ML estimator) i=1
best hypothesis h given data d assuming that the features of each • Can also maximize posterior (MAP estimator)
data point are all independent • When prior is constant, they’re the same Lagrange method is required to convert constrained optimization
Representation: The representation is based on Bayes Theorem: • With lots of data, they’re almost the same problem into unconstrained optimization problem. The goal of above
equation to get the optimal value for w and b.
P (d|Y ) × P (Y ) Support Vector Machines " n
#
P (Y |d) = 1X
⃗ 2+
P (d) λ∥w∥ max 0, 1 − yi (w
⃗ · x⃗i − b)
n i=1
SVM is a go-to for high performance with little tuning. Compares
With naive hypothesis, extreme values in your dataset. The first term is the regularization term, which is a technique to
P (Y |d) = P (x1 , x2 , · · · , xi | Y ) = P (x1 |Y ) × P (x1 |Y ) × · · · P (xi |Y ) In SVM, a hyperplane (or decision boundary: wT x − b = 0) is avoid overfitting by penalizing large coefficients in the solution vector.
n
Y selected to separate the points in the input variables space by their The second term, hinge loss, is to penalize misclassifications. It
P (d|Y ) = P (xi | Y ) class, with the largest margin. The closest datapoints (defining the measures the error due to misclassification (or data points being
i=1 margin) are called the support vectors. closer to the classification boundary than the margin). The λ is the
→ The goal of a support vector machine is to find the optimal regularization coefficient, and its major role is to determine the
The prediction is the maximum a posterior hypothesis: separating hyperplane which maximizes the margin of the training trade-off between increasing the margin size and ensuring that the xi
data. lies on the correct side of the margin.
max (P (Y |d)) = max (P (d|Y ) × P (Y ))
→ Kernel : A kernel is a way of computing the dot product of two
here, the denominator is not kept as it is only for normalization. vectors xx and yy in some (possibly very high dimensional) feature
space, which is why kernel functions are sometime called ”generalized
Learning: Training is fast because only probabilities need to be dot product”. The kernel trick is a method of using a linear classifier
calculated: to solve a non-linear problem by transforming linearly inseparable data
instancesY count(x ∧ Y ) to linearly separable ones in a higher dimension.
P (Y ) = P (x|Y ) =
all instances instancesY Given a feature mapping ϕ, we define the kernel K as follows:
K(x, z) = ϕ(x)T ϕ(z)
Variations: Gaussian Naive Bayes can extend to numerical attributes
by assuming a Gaussian distribution. Instead of P (x|h) are calculated ∥x − z∥2
with P (h) during learning, and MAP for prediction is calculated In practice, the kernel K defined by K(x, z) = e− is
2σ 2
using Gaussian PDF called the Gaussian kernel and is commonly used.
Classification and Regression Trees (CART)
Decision Tree is a Supervised learning technique that can be used for
both Classification and Regression problems.
Note: we say that we use the ”kernel trick” to compute the cost
function using the kernel because we actually don’t need to know the
explicit mapping ϕ, which is often very complicated. Instead, only the
values K(x, z) are needed.
Variations:
SVM is implemented using various kernels, which define the measure
between new data and support vectors:
Here, each node represents a question about the data, and the
X branches from each node represent the possible answers.
1. Linear (dot-product): K(x, xi ) = (x × xi ) → Root Node: It is the very first node (parent node), and denotes
the whole population, and gets split into two or more Decision nodes
X based on the feature values.
2. Polynomial: K(x, xi ) = 1 + (x × xi )d → Decision Node: At each decision node, the algorithm chooses the
best feature and threshold to split the data, aiming to create the
X most homogeneous subsets. They have multiple branches.
3. Radial: K(x, xi ) = e− γ (x − xi )2 → This process continues until a stopping condition is met (like
maximum depth or pure leaves).
→ Leaf Node: The final predictions are made at the leaf nodes,
X 1 which represent the outcome of those decisions.
Data preparation: → Minkowski Distance = |ai − bi |p
p
- SVM assumes numeric inputs, may require dummy transformation → Sub-Tree: A branch is a subdivision of a complete tree.
of categorical features X At each leaf node, CART predicts the most frequent category,
- p=1 gives Manhattan distance |ai − bi |
assuming false negative and false positive costs are the same.
Advantages: qX → The splitting process handles multicollinearity and outliers.
+ Allow nonlinear separation with nonlinear Kernels - p=2 gives Euclidean distance (ai − bi )2 → Trees are prone to high variance, so tune through CV.
+ Works good in high dimensional space
→ Hamming Distance - count of the differences between two vectors, Note: In decision trees, the depth of the tree determines the
+ Robust to multicollinearity and overfitting
often used to compare categorical variables. variance. Decision trees are commonly pruned to control variance
Usecase examples: Time complexity: The distance calculation step requires quadratic • CART for regression minimizes SSE by splitting data into
- Face detection from images time complexity, and the sorting of the calculated distances requires sub-regions and predicting the average value at leaf nodes. The
- Target Audience Classification from tweets an O(N logN ) time. Total process is an O(N 3 logN ) complexity parameter cp only keeps splits that reduce loss by at least
Space complexity: Since it stores all the pairwise distances and is cp (small cp → deep tree).
Hyperparameters: regularization parameter (C) and the kernel sorted in memory on a machine, memory is also the problem. Usually, • CART for classification minimizes the sum of region impurity,
parameters (such as gamma for the RBF kernel). local machines will crash, if we have very large datasets. where pi is the probability of a sample being in category i. Possible
Data preparation: measures, each with a max impurity of 0.5.
X
K-Nearest Neighbors - Rescale inputs using standardization or normalization - Gini Impurity / Gini Index / Gini Coefficient = 1 − (pi )2
- Address missing data for distance calculations
- Dimensionality reduction or feature selection for COD
X
- Cross Entropy = (pi )log2 (pi )
If you are similar to your neighbors, you are one of them. KNN uses Advantages:
+ Effective if the training data is large Procedure:
the entire training data, no training is required.
+ No learning phase 1. Calculate entropy of the outcome classes (c)
Note: Higher k → higher the bias, Lower k → higher the variance. + Robust to noisy data, no need to filter outliers c
X
Usecase examples: E(T ) = −pi log2 pi
• Choice of k is very critical → A small value of k means that noise i=1
will have a higher influence on the result. → A large value of k make - Recommending products based on similar customers
- Anomaly detection in customer behavior 2. The dataset is split on the different attributes. The entropy of
everything classified as the most probable class and also
each branch is calculated. Then it is added proportionally to
computationally expensive.
√ get total entropy for the split. The resulting entropy is
→ A simple approach to select k is set k = n or cross-validating subtracted from the entropy before the split.
on small subset of training data (validation data) by varying values of
Gain(T, X) = Entropy(T ) − Entropy(T, X)
k and observing training - validation error.
3. Choose attributes with largest information Gain as the AdaBoost Unsupervised Machine Learning
decision node, divide the dataset by its branches and repeat • Uses the same training samples at each stage
1. Clustering
the same process on every branch. • ”Weakness” = Misclassified data points
2. Dimension Reduction
4. A branch with entropy of 0 is a leaf node • Learning Focus: Primarily reduces bias by focusing on misclassified
instances Algorithm: 3. Association Rule Mining
5. A branch with entropy more than 0 needs futher splitting.
1. Initialize Weights: Assign equal weight to each of the training 4. Graphical Modelling and Network Analysis
6. ID3 algorithm is run recursively on the non-leaf branches, until
all data is classified data Clustering
Advantages: 2. Train weak model and Evaluate: Provide this as input to the
Grouping objects into meaningful subets or, clusters. → Objects
→ Can take any type of variables and do not require any data weak model and identify the wrongly classified data points
within each cluster are similar.
prepraration 3. Adjust Weights: Increase the weight of wrongly classified data
Clustering Algorithms:
→ Simple to understand, interpret, visualize points
→ Non-linear parameters don’t effect its performance 4. Combined Models: Combine the weak models using a 1. Partition-based methods
weighted sum, where weights are based on the accuracy of (a) K-means clustering
Disadvantages:
each learner. (b) Fuzzy C-Means
→ Overfitting (High variance) occurs, when noise data
→ DT can be unstable (use bagging or boosting) because of small 5. Repeat steps 2-4 for a predefined number of iterations or until 2. Hierarchical methods
variation in data the error is minimized. (a) Agglomerative Clustering
(b) Divisive Clustering
Hyperparameters: The most common Stopping Criterion for splitting • Limitations: Sensitive to noisy data and outliers since misclassified
is a minimum of training observations per node, maximum depth of points are given more focus. 3. Density-based methods
the tree Hyperparameters: number of estimators, learning rate. (a) Density-Based methods (DBSCAN)
Ensemble Algorithms Gradient Boosting
• Uses the different training samples at each stage
K-means clustering
Ensemble methods combine multiple, simpler algorithms (weak • ”Weakness” = Residuals or Erros
learners) to obtain better performance algorithm. • Learning Focus: Instead of adjusting weights, it optimizes the The objective of K-means clustering is to minimize total intra-cluster
Bagging Boosting model by minimizing a loss function (e.g., mean squared error for or, the squared error function.
AdaBoost regression).
K X
n
Random Forest Gradient Boosting Algorithm: X (j)
Objective function → J = ∥Xi − Cj ∥2
XGBoost 1. Initialize Model: Start with an initial model (e.g., a constant
j=1 i=1
• Bagging value). Let’s say Avg.
→ It involves parallel training of multiple models independently on 2. Compute Residuals: Calculate the residuals (errors) of the Here, K is No. of clusters, n is No. of cases, Cj is centroid for
different subsets of the data. These subsets of data are drawn using current model. cluster j
the bootstrap technique. 3. Train Weak Learner: Train a weak learner on the residuals.
→ Then averaging their predictions (for regression) or majority voting 4. Update Model: Add the weak learner to the model with a
(for classification). certain learning rate.
→ It can reduce the variance and prevent overfitting by averaging 5. Repeat steps 2-4 for a fixed number of iterations or until the
out the errors of individual models. model converges.
→ Bootstrapping is drawing random sub-samples (sampling with • Limitations: Slower to train and more prone to overfitting without
replacement) from a large sample (available data) to estimate careful tuning.
quantity (parameters) of a unknown population by averaging the Hyperparameters: learning rate, number of boosting stages,
estimates from these sub-samples. maximum depth of individual trees.
Random Forest XGBoost
→ Bagged Decision Trees: Each DT may contain different no. of • Enhances gradient boosting by making it faster, more efficient, and
rows and different no. of features. more accurate..
→ Individual DTs may face overfitting i.e. have low bias (complex → Execution speed: Parallelization (It will use all cores of CPU),
model) but high variance, by ensembling a lot of DTs we are going to Cache optimization, Out of memory (Data size bigger than memory) 1. Divide data into K clusters or groups.
reduce the variance, while not increasing the bias. → Model performance: 2. Randomly select centroid for each of these K clusters.
Hyperparameters: number of trees, maximum depth of the trees - Adds regularization to balance the trade-off between fitting the 3. Assign data points to their closest cluster centroid according
training data and maintaining model simplicity. to Euclidean/ Square Euclidean/Manhattan/Cosine
• Boosting
- Auto pruning: Prevents trees from growing too large, improving 4. Calculate the centroids of the newly formed clusters.
→ It involves sequentially training of multiple models, where each
generalization and reducing the risk of overfitting. 5. Repeat steps 3 and 4 until the same centroids (convergences)
model tries to correct the errors of the previous ones.
- During training, model learns the optimal way to split data with are assigned to each cluster.
missing values as well as model learns from the patterns of missing
data and adjusts the decision boundaries accordingly.
- Efficient handling of sparse data
- Flexible: Supports a variety of loss functions and custom objective
functions
Hyperparameters: Learning Rate, Number of Trees, Maximum Depth,
Min Child Weight, Subsample, Booster Type, Early Stopping Rounds
→ K-means always converges (mostly to local minimum not to Principle Component Analysis (PCA) Neural Network
global minimum) A neural network is a type of machine learning model that mimics the
• How to choose K number of clusters in K-Means algorithm? PCA combines highly correlated variables into a new, smaller set of
constructs called principal components, which capture most of the structure and function of the human brain to recognize patterns,
→ The maximum possible number of clusters will be equal to the make decisions, and learn from data.
number of observations in the dataset. variance present in the data.
• Dimensionality reduction
Hierarchial Clustering • Feature extraction
Agglomerative method: ”Bottom-up” • Data visualization
Procedure:
1. Compute the distance or, proximity matrix
X − mean → Input Layer: The first layer that receives the input data. Each
2. Initialization: Each observation is a cluster 1. Standarize the data: Z =
SD neuron in this layer corresponds to a feature of the input data.
3. Iteration: Merge two clusters which are most similar; until all → Hidden Layers: Layers between the input and output layers where
observations are merged into a single cluster. 2. Calculate covariance-matrix of the standarized data the network learn complex patterns.
V = cov(Z T ) → Output Layer: The final layer that produces the network’s output,
Divisive method: ”Top-down”
such as a prediction or classification.
1. Compute the distance, or proximity matrix • Perceptron - the foundation of a neural network, and it is a
2. Initialization: All objects stay in one cluster 3. Find eigen-values and eigen-vectors from the single-layer neural network. An Artificial Neuron is a basic building
covariance-matrix block of a neural network.
3. Iteration: Select a cluster and split it into two sub-cluster • Neural Network - a multi-layer perceptron
until each leaf cluster contains only one observation. values, vectors = eig(V )
Proximity (distance) matrix
→ Single or ward linkage: Minimize within cluster distance 4. Feature vectors; It is simply the matrix that has columns, the
h i eigen-vectors of the components that we decide to keep.
L(C1 , C2 ) = min D XiC1 , XjC2
5. Project data → Znew = vectorsT · Z T
→ Complete linkage: Longest distance between two points in each
Association Rule Mining
cluster. Minimize maximum distance of between cluster pairs
h i ”Market Basket Analysis” → It uses Machine Learning models to
L(C1 , C2 ) = max D XiC1 , XjC2 analyze data for patterns or, co-occurence in a database.
→ Average linkage: Minimize average distance between cluster pairs Graphical Modelling and Network Analysis
nC1 nC2 ”Bayesian Networks”
1 X Xh C i
L(C1 , C2 ) = D Xi 1 , XjC2
nC1 nC2 i=1 j=1
DBSCAN
→ Two parameters: ε - distance, minimum points
→ Three classifications of points:
• Core: has atleast minimum points within ε - distance including → Weights: are the real values that are attached with each
itself input/feature and they convey the importance of that feature in
• ε - distance has less than minimum points within ε - distance predicting the final output.
but can be reached by clusters. → Bias: is used for shifting the activation function towards left or
• Outlier: point that cannot be reached by cluster right.
→ Summation Function: used to bind the weights and inputs
Procedure:
together and calculate their sum.
1. Pick a random point that has not been assigned to a cluster → Activation Function: decides whether a neuron should be activated
or, designated as an Outlier. Determine if it is a Core Point. or not, and it introduces non-linearities into the network which makes
If not, label the point as Outlier. input capable of learning and performing more complex tasks.
2. Once a Core Point has been found, add all directly reachable
to its cluster. Then do neighbor jumps to each reachable
point and add them to the cluster. If an Outlier has been
added, label it as a Border Point.
3. Repeat these steps until all points are assigned a cluster or,
label as Outlier.
Evolution of NLP