0% found this document useful (0 votes)
62 views

Machine Learning Cheatsheet Compiled and Curated by Robins Yadav

Hello

Uploaded by

dopemig197
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views

Machine Learning Cheatsheet Compiled and Curated by Robins Yadav

Hello

Uploaded by

dopemig197
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Machine Learning Cheatsheet Discriminative model Generative model

→ The training loss goes down over time, achieving low error values
© 2024 Robins Yadav - magic starts here → The validation loss goes down until a turning point is found, and it
starts going up again. That point represents the beginning of
My github: https://github.com/robinyUArizona overfitting. Therefore, The training process should be stopped when
the validation error trend changes from descending to ascending.
Machine Learning General
Definiton
Learns the decision boundary be- Learns the input distribution
We want to learn a target function f that maps input variables X to tween classes
output variable y, with an error e: Directly estimate P (y|x) Estimate P (x|y) to find likeli-
y = f (X) + e hood P (y|x) using Baye’s rule
Specifically meant for classifica- They are used for generating new
Linear, Non-linear tion tasks contents or data
Logistic Regression, Random Hidden Markov Models, Naive
Different algorithms make different assumptions about the shape and Forests, SVM, Neural Networks, Bayes, Gaussian Mixture Mod-
structure of f . Any algorithm can be either: Decision Tree, kNN els, Gaussian Discriminant Anal-
ysis, LDA, Bayesian Networks
• Paramteric (or Linear): simplify the mapping to a known
linear combination form and learning its coefficients. • Unsupervised learning methods learn to find the inherent structure
• Non-parametric (or Non-linear): free to learn any functional or hidden patterns from unlabeled data X (x(1) , ..., x(m) ) .
form from the training data, while maintaining some ability to
generalize.
Bias-Variance trade-off, Underfitting, Overfitting
Note: Linear algorithms are usually simpler, faster and requires less → Expected test error is the error we expect from predicting new,
data, while Nonlinear can be are more flexible, more powerful and unobserved data points.
more performant. In supervised learning, the prediction (expected) error e is composed
of the bias, the variance and the irreducible part.
Supervised, Unsupervised
 2 h h ii2
• Supervised learning methods learn to predict outcomes y Error(x) = E[fˆ(x)] − f (x) + E fˆ(x) − E fˆ(x) + σe2
(y (1) , ..., y (m) ) from data points X (x(1) , ..., x(m) ) given that the
data is labeled. → Bias refers to erroneous assumptions made by the model about
the data to make the target function easier to learn.
Type of prediction Mathematically, how much predicted values differ from true values?
• Training loss vs. Validation loss:
→ Variance is the error, amount that the prediction (estimate of the
Regression Classification target function) will change if different training data sets were used.
Outcome Continuous Class It measures how scattered (inconsistent) are the predicted values from
Examples Linear Regression Logistic Regression, the correct value due to different training data (or possibly with
SVM, Naive Bayes different random seeds) sets. It is also known as Variance Error or
Error due to Variance.
→ Conditional Estimates Note: As the complexity of the model rises, the variance will
Regression → conditional expectation: E [y|X = x] increase and bias will decrease. → Epochs: One Epoch is when an ENTIRE dataset is passed forward
Classification → conditional probability: P (Y = y|X = x)
• The goal of parameterization is to achieve a low bias and low and backward through the neural network only ONCE.
Type of models variance trade-off through methods such as: → Batch: You can’t pass the entire dataset into the neural net at
→ Cross-validation can be used to tune models so as to optimize the once. So, you divide dataset into No. of Batches or sets or parts.
→ Discriminative Model: It focuses on predicting the labels of the trade-off → Iterations is the No. of Batches needed to complete One Epoch.
data. A discriminative machine learning trains a model which is done → Dimension reduction and feature selection Question: If total number of samples in a dataset is 1000 and batch
by learning parameters that maximize the conditional probability → Mixture models (probabilistic models) and ensemble learning. size is 10, how many iterations will be there in one epoch. Ans: 100
P (Y |X)
• Underfitting or High bias means that the model is not able to • How would you identify if your model is overfitting? By analyzing
→ Generative Model: It focuses on learning a probability distribution
capture or learn the trend or pattern in data. the learning curves, you should be able to spot whether the model is
for the dataset, it can reference this probability distribution to
• Overfitting or High variance means that the model fits the underfitting or overfitting. The y-axis is some metric of learning (ex:,
generate new data instances. A generative model learns parameters
available data but does not generalize well to predict on new data. classification accuracy) and the x-axis is experience (time or No. of
by maximizing the joint probability of P (X, Y ).
iteration).
• Regularization - Dropout During training, randomly set some
activations to 0. This forces network to not rely on any one node.
Unrepresentative Training Dataset → Leave-One-Out Cross-Validation (LOOCV): A special case where Prediction and Inference
k equals the number of data points, so each fold contains just one
When the data available during training is not enough to capture the data point. → Prediction uses a model to predict future observations.
model, relative to the validation dataset. • Model does not need to be valid
Data Science General • Evaluation does need to be valid • Quality & strength: accuracy of
predicting unseen data
Notes → Inference uses the model’s structure and parameters to learn or
understand an underlying phenomenon.
→ Parameters: Model parameters are numeric values as part of the
• Validity depends on assumptions
model. We estimate them from the data.
• Quality & strength: R2 , p-value, assumption checks, coefficients
→ Ablation: An ablation study is turning off components of a model
(e.g. features or sub-models) one at a time, to see how much each ETL - Extract Transform Load
contributes to the model’s performance.
→ Overfitting arises when our model learns too much, so it can’t An ETL workflow is essential for consolidating and preparing data for
generalize to new data. analysis, reporting, and business intelligence. Organizations can
The train and validation curves are improving, but there’s a big gap
→ Apply log-transformations (e.g., log, square root) to reduce ensure data accuracy, consistency, and availability by following a
between them, which means they operate like datasets from different
skewness and improve model performance, especially for models well-defined ETL process, enabling better decision-making and
distributions.
sensitive to outliers. insights.
Unrepresentative Validation Dataset → Apply transformations based on domain knowledge and 1. Extract Phase:
understanding of the data characteristics. Goal: Retrieve raw data from various source systems.
Preventing Data Leakage • Identify Data Sources: Determine the databases, files, APIs, or
other sources from which data will be extracte
• Proper Data Splitting: Split data into training, validation, and • Connect to Data Sources: Establish connections to these
test sets before performing any data preprocessing. For time sources using appropriate connectors or drivers.
series, split data chronologically. • Extract Data: Perform the actual data extraction, which may
• Transformation fit on Training Data Only: Ensure that involve running SQL queries, reading files, or calling API
transformations (e.g., scaling, encoding) are fit only on the endpoints.
training data and then applied to both training and test data • Handle Incremental Extraction: Extract only the new or
updated data since the last extraction to optimize performance
• Feature Selection: Ensure that features used for training are
As we can see, the training curve looks ok, but the validation function and reduce load.
available at the time of prediction. For time series data, create
moves noisily around the training curve. It could be the case that lag features that use only past information. Tools: SQL queries, Python scripts, database connectors, API
validation data is scarce and not very representative of the training • Target Leakage: Ensure that features used in training are clients.
data, so the model struggles to model these examples. available at the time of prediction and do not contain information 2. Transform Phase:
derived from the target variable. Goal: Clean, format, and prepare the data for loading into the
• Data Augmentation: Apply augmentation techniques only to the target system.
training set. • Data Cleaning: Remove duplicates, Handle missing values,
Correct errors and inconsistencies
Handle missing or corrupted data in a dataset • Data Integration: Merge data from different sources, Resolve
data conflicts and redundancies.
• Remove Missing Data: If only a small number of rows have
missing values, or If an entire column has a large percentage of • Data Transformation: Convert data types, Normalize or
missing values denormalize data, Apply business rules and calculations,
Aggregate data (e.g., summing, averaging).
• Impute Missing Data: Mean/Median/Mode Imputation,
• Data Enrichment: Add additional context or metadata, Join
Here, the validation loss is much better than the training one, which Imputation Using Algorithms, Forward/Backward Fill/ Interpolate
with reference data
reflects the validation dataset is easier to predict than the training For time series data
• Data Validation: Check for data quality and integrity, Ensure
dataset. An explanation could be the validation data is scarce but • Use Algorithms That Handle Missing Data: Decision transformations have been applied correctly
widely represented by the training dataset, so the model performs trees,Random forests
extremely well on these few examples. Tools: Python, SQL, transformation tools like Apache NiFi,
• Replace Corrupted Data: Identify Outliers using statistical Talend, or Apache Spark.
• What is cross-validation? Why it’s important? methods (like z-scores or IQR), Apply manual inspection and
Cross-validation evaluates a model’s performance. correction of corrupted values based on external sources or 3. Load Phase:
→ The idea is to divide the dataset into k subsets or ”folds”, train domain expertise. Goal: Load the transformed data into the target data warehouse or
the model on k − 1 of these folds, and test on the remaining fold to • Data Augmentation: To create synthetic data based on existing database.
ensure that the model generalizes well to unseen data. patterns • Prepare Target Schema: Define the schema of the target
→ After evaluating on all k folds, performance metrics are averaged database or data warehouse, Create tables, indexes, and
for a robust estimate of the model’s effectiveness. Inference vs. Classification partitions as needed
→ K-Fold Cross-Validation: The data is divided into k equally-sized • Load Data: Insert transformed data into the target tables, Use
folds. → Inference: Given two groups, what is the differences between these bulk load methods for efficiency when dealing with large
→ Stratified K-Fold: Similar to K-Fold, but it maintains the groups? t-tests, paired t-test etc. datasets
proportion of classes in each fold, making it ideal for imbalanced → Classification: Given a new animal, find whether new animal is cat • Post-Load Validation: Verify that the data has been loaded
datasets. or dog? correctly, Check for any discrepancies or errors
• Indexing and Partitioning: Optimize the target database for TP + TN
(1) Accuracy: → Ratio of correct
query performance, Create necessary indexes and partitions TP + TN + FP + FN
Tools: Database clients, bulk load utilities, ETL tools with load predictions over total predictions.
functionality. Estimate of P [D = Y ] , probability of decision is equal to outcome.

ETL Workflow Automation


Automation Tools: TP
(2) Recall or Sensitivity or True positive rate: .
→ Apache Airflow: For workflow scheduling and orchestration. TP + FN
→ AWS Glue: Managed ETL service on AWS. Completness of model. → Out of total actual positive (1) values,
→ Apache NiFi: For data flow automation.
how often the classifier is correct. Probability: P [D = 1|Y = 1]
Model Evaluation Example: ”Fradulent transaction detector” or ”Person Cancer” →
+ve (1) is ”fraud”: Optimize for sensitivity because false positive (FP
Classification Problems normal transactions that are flagged as possible fraud) are more
Confusion Matrix acceptable than false negative (FN fradulent transactions that are not
• The data gives us outcomes (“truth”) (y|Y ) detected) Note: We can think of the plot as the fraction of correct predictions
• The model makes decisions (d|D) (saving ŷ scores) for the positive class (y-axis) versus the fraction of errors for the
Then, we compare decisions (d) to outcomes (y) TP negative class (x-axis).
(3) Precision : Exactness of model. → Out of total
TP + FP (10) AUC: Area Under the ROC Curve: To compute the points in an
Type I error: The null hypothesis H0 is rejected when it is true.
predicted positive (1) values, how often classifier is correct. ROC curve, an efficient, sorting-based algorithm called AUC. AUC
Type II error: The null hypothesis H0 is not rejected when it is
ranges in value from 0 to 1. Area Under the Curve measures how
false. Probability: P [Y = 1|D = 1] , If our model says positive, how likely likely the model differentiates positives and negatives (perfect AUC =
→ False negative (Type I error: ) — incorrectly decide no it is correct in that judgement. 1, basline = 0.5)
→ False positive (Type II error: ) — incorrectly decide yes Example: ”Spam Filter” +ve (1) class is spam → Optimize for
precision or, specificity because false negatives (FN spam goes to the
inbox) are more acceptable than false positive (FP non-spam is
caught by the spam filter). Example: ”Hotel booking cancelled”
+ve (1) class is isCancelled → Optimize for precision or, specificity
because false negatives (FN isCancelled labeled as ”not cancelled” 0)
are more acceptable than false positive (FP isnotCancelled labeled as
”cancelled” 1).

Precision × Recall
(4) F1-Score = 2 × → False positive (FP) and
Precision + Recall
Ex: We assume the null hypothesis H0 is true. False negative (FN) are equally important.
→ H0 : Person is not guilty How to choose threshold for the logistic regression? The choice of a
→ H1 : Person is guilty threshold depends on the importance of TPR and FPR classification
FP
(5) False Positive Rate: — Fraction of negatives problem. For example: Suppose you are building a model to predict
TN + FP customer churn. False negatives (not identifying customers who will
wrongly classified positive. Probability: P [D = 1|Y = 0] churn) might lead to loss of revenue, making TPR crucial. In
contrast, falsely predicting churn (false positives) could lead to
unnecessary retention efforts, making FPR important. If there is no
FN external concern about low TPR or high FPR, one option is to weight
(6) False Negative Rate: = 1-Recall — Fraction of
TP + FN them equally by choosing the threshold that maximizes TPR−FPR.
positives wrongly classified negative. Probability: P [D = 0|Y = 1] Precision-Recall curve: - Focuses on the correct prediction of the
minority class, useful when data is imbalanced. Plot precision at
different thresholds.
TN
(7) Specificity: = 1-FPR — Fraction of negatives
TN + FP
rightly classified negative. Probability: P [D = 0|Y = 0]

FP
(8) ”Fradulent transaction detector”, FPR = →
FP + TN
probability of falsely rejecting ”Null Hypothesis” H0

(9) ROC-curve: What FPR must you tolerate for a certain TPR? An
ROC curve plots TPR vs. FPR at different classification thresholds α.
→ Lowering the classification threshold classifies more items as
positive, thus increasing both False Positives and True Positives.
Regression Problems Optimization
Almost every machine learning method has an optimization algorithm
1X
1. Mean Squared Error: MSE = (yi − ŷ) at its core.
n i → Hypothesis : The hypothesis is noted hθ and is the model that
we choose. For a given input data x(i) the model prediction output is
hθ (x(i) ) .
s
PN
i=1 (yˆi − yi )2
2. Root Mean Squared Error: RMSE =
N → Loss function : L : (ŷ, y) ∈ R × Y 7−→ L(ŷ, y) ∈ R that takes
as inputs the predicted value ŷ corresponding to the real data value
1X y and outputs how different they are. The loss function is the Tips :
3. Mean Absolute Error: MAE = |yi − ŷ|
n i function that computes the distance or difference between the • Change learning rate α (”size of jump” at each iteration)
current output ŷ of the algorithm and the expected output y. • Plot Cost vs. Time to assess learning rate performance.
X The common loss functions are summed up in the table below: • Rescaling the input variables
4. Sum of Squared Error: SSE = (yi − ŷ)2
• Reduce passes through training set with SGD
i Least squared error Logistic loss Hinge loss
1 • Average over 10 or more updated to observe the learning trend
(y − ŷ)2 log (1 + exp(−y ŷ)) max(0, 1 − y ŷ)
X 2 while using SGD
5. Total Sum of Squares: SST = (yi − ȳ)2 Linear Regression Logistic Regression SVM • Stochastic Gradient Descent apply the procedure of parameter
i updating for each observation.
→ Time Complexity: O(km2 ) → m is the sample of data selected
6. R2 Error : randomly from the entire data of size n
MSE (model) SSE → It only uses a single point to compute gradients, leading to
R2 = 1 − R2 = 1 − smoother convergence and faster compute speeds.
MSE(baseline) SST
• Mini-batch Gradient Descent trains on small subsets of the data,
striking a balance between the approaches.
→ Cost function : The cost function J is commonly used to know
7. Adjusted R2 :
the performance of a model, and is defined with the loss function L Ordinary Least Squares

n−1
  as follows:
Ra2 =1− (1 − R ) 2 m Least Squares Regression
n−k−1
X
J(θ) = L(hθ (x(i) ), y (i) ) We fit linear models:
i=1 X
ŷ = β0 + βj xj
Variance, R2 and the Sum of Squares j
Convex & Non-convex
• The total sum of squares: SStotal = i (yi − ȳ)2
P
Here, βj is the j-th coefficient and xj is the j-th feature.
1 P A convex function is one where a line drawn between any two points
• This scales with variance: var(Y ) = n i (yi − ȳ)2 Ordinary Least Squares - find β ⃗ that minimizes squared error:
P ˆ on the graph lies on or above the graph. It has one minimum. A
• The regression sum of squares: SSreg = (yi − ȳ)2
i non-convex function is one where a line drawn between any two X
, → nVar(predictions) arg min (yi − yˆi )2
points on the graph may intersect other points on the graph. It ⃗
β i
• The residual
P sum of squares (squared errro): characterized as ”wavy”
SSresid = i (yi − yˆi )2 , → nVar(ϵ) → When a cost function is non-convex, it means that there is a
Note: ϵ̄ = 0, E[ŷ] = ȳ likelihood that the function may find local minima instead of the
SStotal = SSreg + SSresid global minimum, which is typically undesired in machine learning
models from an optimization perspective.
SSresid SSreg nV ar(P reds) V ar(P reds) Gradient Descent
R2 = 1 − = = =
SStotal SStotal nV ar(Y ) V ar(Y )
Gradient Descent is used to find the coefficients of f that minimizes
→ Explained Variance: R2 quantifies how much of the variability in a cost function (for example MSE, SSR).
the outcome (dependent) variable can be explained by the predictor → Time Complexity: O(kn2 ) → n is no. of data points.
(independent) variables. → It minimizes the average loss by moving iteratively in the direction
→ An R2 of 1 indicates a perfect fit, where the model explains all the of steepest descent, controlled by the learning rate γ (step size).
variability, while an R2 of 0 indicates that the model explains none of Note, γ can be updated adaptively for better performance. ⃗ = ŷ
Goal: least-squares solution to : X β
the variability. Procedure:
→ Goodness of Fit: A higher R2 value generally suggests a better fit Solution: solve the normal equations:
of the model to the data, meaning the model’s predictions are closer 1. Intialization θ = 0 (coefficients to 0 or random) ⃗ = X T ŷ ⃗ = (X T X)−1 X T ŷ
XT Xβ →β
to the actual values.

2. Calculate cost J(θ) = evaluate f (coefficients)
→ R2 is not valid for nonlinear models as ⃗ |X,
L(β ⃗ ŷ)

3. Gradient of cost ∂θj
J(θ) we knows the uphill direction
SSresidual + SSerror ̸= SST General Optimization:

4. Update coeff θj = θj − α ∂θ J(θ) we go downhill 1. Understand data (features and outcome variables)
j
→ Drawback: R-squared will always increase when a new predictor 2. Define loss (or gain/utility) function
The cost updating process is repeated until convergence (minimum
variable is added to the regression model. 3. Define predictive model
found).
4. Search for parameters that minimize loss function Conditioning on Parameters Linear Regression
⃗ and write function:
Fuller definition - condition on parameters β
Augmented Loss yˆi = β0 + β1 xi1 + β2 xi2 · · · + βp xip + ϵ
We can add more things to the loss function ⃗ = ŷ = m(x, β)
P (Y = 1|x, β) ⃗ = logistic(...)
p
• Penalize model complexity n X
X
yˆi = β0 + βj xij
• Penalize ”strong” beliefs Likelihood Function i=1 j=1
– Requires predictive utility to overcome them Given data y = ⟨y1 , ..., yn ⟩, x = ⟨x1 , ..., xn ⟩ and parameters β̂
Here, n is total no. of observation, yi is dependent variable, xij is
→ Least squares generalizes into minimizing loss functions. ⃗ = P (y, x|β)
⃗ ∝ P (y|x, β)
⃗ =
Y
⃗ explanatory variable of j-th features of the i-th observation. β0 is
→ This is the heart of machine learning, particularly supervised Likelihood(y, x, β) P (yi |xi , β)
i
intercept or usually called bias coefficient.
learning. Assumptions:
This is weird: → Linear models make four key assumptions necessary for inferential
Maximum Likelihood Estimation MLE ⃗ ∝ P (y|x, β)
P (y, x|β) ⃗ validity.
• Linearity — outcome y and predictor X have linear relationship.
Any probability distribution has parameters, so fitting parameters is ⃗ = P (y|x, β)P
P (y, x|β) ⃗ (x|β)
⃗ • Independence — observations are independent of each other
an extremely crucial part of data analysis. There are two general - Independent variables (features) are not highly correlated with each
methods for doing so. In maximum likelihood estimation (MLE), the ⃗ = P (x). And x is fixed, so
But x is independent of params, so P (x|β)
P (x) is an (unknown) constant. other → Low multicollinearity
goal is to estimate the most likely parameters given a likelihood • Normal errors — residuals are normally distributed - check with
function: Q-Q plots. Violation means line (in Q-Q plots) still fits but p-value
⃗ = log Likelihood(y, x, β)
LogLik(y, x, β) ⃗
θMLE = arg max L(θ) , where L(θ) = fn (x1 , . . . xn |θ) and CIs are unreliable
Y   • Equal variance — residuals have constant variance (called
Since the values of X are assumed to be i.i.d., then the likelihood = log P (x) ⃗
P yi |xi , β homoskedasticity; violation is heteroskedasticity) - check scatterplot
function becomes the following: i or regplot between residuals vs. fitted. Violations means model is
X 

 failing to capture a systematic effect. → These violations are problem
n = log P (x) + log P yi |xi , β
Y only for inference not for prediction
L (θ) = f (xi |θ) i
i=1
Maximum Likelihood Estimator
The natural log of L(θ) is then taken prior to calculating the X  
maximum; since log is a monotonically increasing function, arg max ⃗
logP yi |xi , β

β
maximizing the log-likelihood log L(θ) is equivalent to maximizing i
the likelihood:
n
X P (Y = yi |X = xi ) = yˆi yi (1 − yˆi )1−yi
log L(θ) = log f (xi |θ)
i=1 logP (Y = yi |X = xi ) = yi log yˆi + (1 − yi ) log (1 − yˆi )
→ MLE is used to find the estimators that minimized the likelihood Variance Inflation Factor : Measures the severity if multicollinearity
Model log likelihood is sum over training data. Applicable to any
function: L(θ|x) = fθ (x) density function of the data distribution 1
model where ŷ = P (Y = 1|x) → , where Ri2 is found by regressing Xi aganist all other
1 − Ri2
Maximum a Posterior MAP
variables (a common VIF cutoff is 10)
Another way of fitting parameters is through maximum a posterior Linear Algorithms Learning: Estimating the coefficients β from the training data using
estimation (MAP), which assumes a ”prior distribution”: the optimization algorithm Gradient Descent or Ordinary Least
Squares.
θMAP = arg max g(θ) f (x1 . . . xn |θ) Regression Ordinary Least Squares - where we find β ⃗ that minimizes squared
error: X
where the similar log-likelihood is again employed, and g(θ) is a → Regression predicts (or estimates) a continuous variable arg min (yi − yˆi )2
density function of θ. ⃗
β i
Dependent variable Y , Independent variable(s) X
Log Likelihood → compute estimate ŷ ≈ y
yˆi = β0 + β1 xi
Logistic Regression:  
X yi = yˆi + ϵi
P (Y = 1|X = x) = ŷ = logistic β0 + βj xj 
j Here, β0 is intercept, β1 P
is slope and ϵ is residuals. The goal is to
The model computes probability of yes. learn β0 , β1 to minimize ϵ2i (least squares)
Probability of Observed Linearity: A linear equation of k + 1 variables is of the form:
What if we want P (Y = yi ), regardless of whether yi is 1 or 0?
P (Y = yi |X = xi ) = yˆi yi (1 − yˆi )1−yi ŷ = β0 + β1 x1 + · · · + βk xk

• ŷi is model’s estimate of P (Y = 1|X = xi ) → The dimension of the hyperplane of the regression is its
It is the sum of scalar multiples of the individual variables - aline! complexity.
• yi ∈ {0, 1} is outcome
→ Linear models are remarkably capable of transforming many Variations: There are extensions of Linear Regression training called
y
• ŷi i is ŷi if yi = 1, and 1 if yi = 0 — multiplicative if non-linear problems into linear. regularization methods, that aim to reduce the complexity of the
models or to address over-fitting in ML. The regularizer is not → Ex: Odds(failure) = x → means x:1 aganist success The representation below is an equation with binary output, which
dependent on the data. → In relation to the bias-variance trade-off, • Log Odds or logit → actually models the probability of default class:
regularization aims to decrease complexity in a way that significantly P (A) Assumptions:
reduces variances while only slightly increasing bias. log Odds(A) = log = logP (A) − log (1 − P (A)) - Linear relationship between X and log-odds of Y
1 − P (A)
→ Standardize numeric variables when using regularization because - Observations must be independent to each other
to ensure that 0 is a neutral value, so a low coefficient means ”little • Logistic: The inverse of the logit (logit− 1): - Low multicollinearity
effect when deviating from average”. So values, and therefore
Learning: Learning the logistic regression coefficients is done by:
coefficients, are on the same scale (# of standard deviations), to 1 ex → Minimizing the logistic loss function
properly distribute weight between them. logistic(x) = = x
1 + e−x e +1 X
⃗ i)

→ Multicollinearity → correlated predictors. Problem: Which arg min log 1 + exp(−yi βx
coefficient gets the common effect? To solve: Loss and ⃗
β i
Regularization comes.
→ Maximizing the log likelihood of the training data given the model
• Ridge Regression (L2 regularization): where OLS is modified  
to minimize the squared sum of the coefficients
X
arg max ⃗
log P yi |xi , β
sigmoid or logistic curve.
n p p p ⃗
β i
X X X X
2
(yi − β0 − βj xij ) + λ βj2 = RSS + λ βj2
i=1 j=1 j=1 j=1 P (Y = yi |X = xi ) = yˆi yi (1 − yˆi )1−yi
logP (Y = yi |X = xi ) = yi log yˆi + (1 − yi ) log (1 − yˆi )
→ Prevents the weights from getting too large (L2 norm). If
lambda is very large then it will add too much weight and it → Odds are another way of representing probabilities. Model log likelihood is sum over training data. Applicable to any
will lead to under-fit. → The logistic and logit functions convert between probabilities and model where ŷ = P (Y = 1|x)
1 log-odds. Data preparation:
λ∝ • General Linear Models (GLMs): - Probability transformation to binary for classification
model variance
- Remove noise such as outliers
yˆi = g −1 (β0 + β1 xi1 + β2 xi2 · · · + βp xip )
• Lasso Regression (L1 regularization) : where OLS is modified Advantages:
to minimize the sum of the coefficients   + Good classification baseline considering simplicity
p + Possibility to change cutoff for precision/recall tradeoff
n p p p X
X X X X ŷi = g −1 β0 + βj xij  + Robust to noise/overfitting with L1/L2 regularization
(yi − β0 − βj xij )2 + λ |βj | = RSS + λ |βj |
j=1 + Probability output can be used for ranking
i=1 j=1 j=1 j=1
Usecase examples:
where p is the no. features (or dimensions), λ ≥ 0 is a tuning Here, g is a link function - Customer scoring with probability of purchase
parameters to be determined. • Counts: Poision regression, log link func - Classification of loan defaults according to profile
• Binary: Logistic regression, logit link func and g −1 is logistic func
→ Lasso shrinks the less important feature’s coefficient to
→ In logistic regression, a linear output is converted into a probability Linear Discriminant Analysis
zero thus, removing some feature altogether. If lambda is very
between 0 and 1 using the sigmoid or logistic function.
large value will make coefficients zero hence it will under-fit. For multiclass classification, LDA is the preferred linear technique.
 
→ L1 is less likely to shrink coefficients to 0. Therefore L1 X Representation: LDA representation consists of statistical properties
regularization leads to sparser models. P (yi = 1|X) = ŷi = logistic β0 + βj xij  calculated for each class: means and the covariance matrix:
j
Data preparation: 1 X
n
1 X
n
- Transform data for linear relationship (ex: log transform for µk = xi σ2 = (xi − µk )2
nk i=1 n − k i=1
exponential relationship) eβ0 +β1 x1 +···+βi xi
- Remove noise such as outliers p(X) = = p(y = 1 | X)
1 + eβ0 +β1 x1 +···+βi xi
- Rescale inputs using standardization or normalization
Advantages: Note : Coefficients are linearly related to odds, such that a one unit
+ Good regression baseline considering simplicity increase in x1 affects odds by eβ1 .
+ Lasso/Ridge can be used to avoid overfitting Note : The coefficients in logistic regression are interpreted in
+ Lasso/Ridge permit feature selection in case of collinearity terms of their effect on the log-odds of the outcome, and the
exponentiated coefficients (odds ratios) provide a clearer
Usecase examples: understanding of the change in odds associated with each predictor.
- Product sales prediction according to prices or promotions
- Call-center waiting-time prediction according to the number of
complaints and the number of working agents
Logistic Regression
Log-Odds and Logistics
• Odds
The probability of success P (S): 0 ≤ p ≤ 1
→ The odds of success are defined as the ratio of the probability of
success over the probability of failure. LDA assumes Gaussian data and attributes of same σ 2 . Predictions
P (S) P (S)
The odds of success: Odds(S) = P (S c ) = 1−P (S) are made using Bayes Theorem:
P (k) × P (x|k) 1 (x − µ)2
P (y = k | X = x) = Pk f (x | µ(x), σ) = √ e−
l=1 P (l) × P (x|l) 2πσ 2 2σ 2
v
to obtain a discriminate function (latent variable) for each class k, n u n
1X u1 X
estimating P (x|k) with a Gaussian distribution: µ(x) = xi σ=t (xi − µ(x))2
n i=1 n i=1
µk µ2
Dk (x) = x × 2
− k2 + ln(P (k))
σ 2σ Data preparation:
- Change numerical inputs to categorical (binning) or near Gaussian
The class with largest discriminant value is the output class. inputs (remove outliers, log & boxcox transform)
Variations: - Other distributions can be used instead of Gaussian The prediction function is the signed distance of the new input x to
- Log-transform of the probabilities can avoid overflow the separating hyperplane w, with b the bias:
1. Quadratic DA: Each class uses its own variance estimate
- Probabilities can be updated as data becomes available f (x) = ⟨w, x⟩ + b = wT x + b
2. Regularized DA: Regularization into the variance estimate.
Advantages:
Data preparation: → Optimal margin classifier: The optimal margin classifier h is such
+ Fast because of the calculations
- Review and modify univariate distributions to be Gaussian that:
+ If the naive assumptions works can converge quicker than other h(x) = sign(wT x − b)
- Standardize data to µ = 0, σ = 1 to have same variance models. Can be used on smaller training data.
- Remove noise such as outliers + Good for few categories variables where (w, b) ∈ Rn × R is the solution of the following optimization
Advantages: problem:
Usecase examples: 1
+ Can be used for dimensionality reduction by keeping the latent min ∥w∥2
- Article classification using binary word presence 2
variables as new variables - Email spam detection using a similar technique such that
Usecase example:
- Prediction of customer churn y (i) (wT x(i) − b) ≥ 1
Likelihood and Posterior
Learning:
Nonlinear Algorithms P (y|θ) P (θ) → Hinge loss : The hinge loss is used in the setting of SVMs and is
P (θ|y) = defined as follows:
All Nonlinear Algorithms are non-parametric and more flexible. They P (y)
L(ŷ, y) = [1 − y ŷ]+ = max(0, 1 − y ŷ)
are not sensible to outliers and do not require any shape of • P (θ) is the prior
distribution. • P (y|θ) isR the likelihood – how likely is the data given params θ
• P (y) = P (y|θ)P (θ)dθ is a scaling factor (constant for fixed y) → Lagrangian : We define the Lagrangian L(w, b) as follows:
Naive Bayes Classifier • P (θ|y) is the posterior. l
X
L(w, b) = f (w) + βi hi (w)
Naive Bayes is a classification algorithm interested in selecting the • We’re maximizing likelihood (ML estimator) i=1
best hypothesis h given data d assuming that the features of each • Can also maximize posterior (MAP estimator)
data point are all independent • When prior is constant, they’re the same Lagrange method is required to convert constrained optimization
Representation: The representation is based on Bayes Theorem: • With lots of data, they’re almost the same problem into unconstrained optimization problem. The goal of above
equation to get the optimal value for w and b.
P (d|Y ) × P (Y ) Support Vector Machines " n
#
P (Y |d) = 1X
⃗ 2+

P (d) λ∥w∥ max 0, 1 − yi (w
⃗ · x⃗i − b)
n i=1
SVM is a go-to for high performance with little tuning. Compares
With naive hypothesis, extreme values in your dataset. The first term is the regularization term, which is a technique to
P (Y |d) = P (x1 , x2 , · · · , xi | Y ) = P (x1 |Y ) × P (x1 |Y ) × · · · P (xi |Y ) In SVM, a hyperplane (or decision boundary: wT x − b = 0) is avoid overfitting by penalizing large coefficients in the solution vector.
n
Y selected to separate the points in the input variables space by their The second term, hinge loss, is to penalize misclassifications. It
P (d|Y ) = P (xi | Y ) class, with the largest margin. The closest datapoints (defining the measures the error due to misclassification (or data points being
i=1 margin) are called the support vectors. closer to the classification boundary than the margin). The λ is the
→ The goal of a support vector machine is to find the optimal regularization coefficient, and its major role is to determine the
The prediction is the maximum a posterior hypothesis: separating hyperplane which maximizes the margin of the training trade-off between increasing the margin size and ensuring that the xi
data. lies on the correct side of the margin.
max (P (Y |d)) = max (P (d|Y ) × P (Y ))
→ Kernel : A kernel is a way of computing the dot product of two
here, the denominator is not kept as it is only for normalization. vectors xx and yy in some (possibly very high dimensional) feature
space, which is why kernel functions are sometime called ”generalized
Learning: Training is fast because only probabilities need to be dot product”. The kernel trick is a method of using a linear classifier
calculated: to solve a non-linear problem by transforming linearly inseparable data
instancesY count(x ∧ Y ) to linearly separable ones in a higher dimension.
P (Y ) = P (x|Y ) =
all instances instancesY Given a feature mapping ϕ, we define the kernel K as follows:
K(x, z) = ϕ(x)T ϕ(z)
Variations: Gaussian Naive Bayes can extend to numerical attributes
by assuming a Gaussian distribution. Instead of P (x|h) are calculated ∥x − z∥2
with P (h) during learning, and MAP for prediction is calculated In practice, the kernel K defined by K(x, z) = e− is
2σ 2
using Gaussian PDF called the Gaussian kernel and is commonly used.
Classification and Regression Trees (CART)
Decision Tree is a Supervised learning technique that can be used for
both Classification and Regression problems.

Note: we say that we use the ”kernel trick” to compute the cost
function using the kernel because we actually don’t need to know the
explicit mapping ϕ, which is often very complicated. Instead, only the
values K(x, z) are needed.

Variations:
SVM is implemented using various kernels, which define the measure
between new data and support vectors:
Here, each node represents a question about the data, and the
X branches from each node represent the possible answers.
1. Linear (dot-product): K(x, xi ) = (x × xi ) → Root Node: It is the very first node (parent node), and denotes
the whole population, and gets split into two or more Decision nodes
X based on the feature values.
2. Polynomial: K(x, xi ) = 1 + (x × xi )d → Decision Node: At each decision node, the algorithm chooses the
best feature and threshold to split the data, aiming to create the
X most homogeneous subsets. They have multiple branches.
3. Radial: K(x, xi ) = e− γ (x − xi )2 → This process continues until a stopping condition is met (like
maximum depth or pure leaves).
→ Leaf Node: The final predictions are made at the leaf nodes,
X 1 which represent the outcome of those decisions.
Data preparation: → Minkowski Distance = |ai − bi |p
p

- SVM assumes numeric inputs, may require dummy transformation → Sub-Tree: A branch is a subdivision of a complete tree.
of categorical features X At each leaf node, CART predicts the most frequent category,
- p=1 gives Manhattan distance |ai − bi |
assuming false negative and false positive costs are the same.
Advantages: qX → The splitting process handles multicollinearity and outliers.
+ Allow nonlinear separation with nonlinear Kernels - p=2 gives Euclidean distance (ai − bi )2 → Trees are prone to high variance, so tune through CV.
+ Works good in high dimensional space
→ Hamming Distance - count of the differences between two vectors, Note: In decision trees, the depth of the tree determines the
+ Robust to multicollinearity and overfitting
often used to compare categorical variables. variance. Decision trees are commonly pruned to control variance
Usecase examples: Time complexity: The distance calculation step requires quadratic • CART for regression minimizes SSE by splitting data into
- Face detection from images time complexity, and the sorting of the calculated distances requires sub-regions and predicting the average value at leaf nodes. The
- Target Audience Classification from tweets an O(N logN ) time. Total process is an O(N 3 logN ) complexity parameter cp only keeps splits that reduce loss by at least
Space complexity: Since it stores all the pairwise distances and is cp (small cp → deep tree).
Hyperparameters: regularization parameter (C) and the kernel sorted in memory on a machine, memory is also the problem. Usually, • CART for classification minimizes the sum of region impurity,
parameters (such as gamma for the RBF kernel). local machines will crash, if we have very large datasets. where pi is the probability of a sample being in category i. Possible
Data preparation: measures, each with a max impurity of 0.5.
X
K-Nearest Neighbors - Rescale inputs using standardization or normalization - Gini Impurity / Gini Index / Gini Coefficient = 1 − (pi )2
- Address missing data for distance calculations
- Dimensionality reduction or feature selection for COD
X
- Cross Entropy = (pi )log2 (pi )
If you are similar to your neighbors, you are one of them. KNN uses Advantages:
+ Effective if the training data is large Procedure:
the entire training data, no training is required.
+ No learning phase 1. Calculate entropy of the outcome classes (c)
Note: Higher k → higher the bias, Lower k → higher the variance. + Robust to noisy data, no need to filter outliers c
X
Usecase examples: E(T ) = −pi log2 pi
• Choice of k is very critical → A small value of k means that noise i=1
will have a higher influence on the result. → A large value of k make - Recommending products based on similar customers
- Anomaly detection in customer behavior 2. The dataset is split on the different attributes. The entropy of
everything classified as the most probable class and also
each branch is calculated. Then it is added proportionally to
computationally expensive.
√ get total entropy for the split. The resulting entropy is
→ A simple approach to select k is set k = n or cross-validating subtracted from the entropy before the split.
on small subset of training data (validation data) by varying values of
Gain(T, X) = Entropy(T ) − Entropy(T, X)
k and observing training - validation error.
3. Choose attributes with largest information Gain as the AdaBoost Unsupervised Machine Learning
decision node, divide the dataset by its branches and repeat • Uses the same training samples at each stage
1. Clustering
the same process on every branch. • ”Weakness” = Misclassified data points
2. Dimension Reduction
4. A branch with entropy of 0 is a leaf node • Learning Focus: Primarily reduces bias by focusing on misclassified
instances Algorithm: 3. Association Rule Mining
5. A branch with entropy more than 0 needs futher splitting.
1. Initialize Weights: Assign equal weight to each of the training 4. Graphical Modelling and Network Analysis
6. ID3 algorithm is run recursively on the non-leaf branches, until
all data is classified data Clustering
Advantages: 2. Train weak model and Evaluate: Provide this as input to the
Grouping objects into meaningful subets or, clusters. → Objects
→ Can take any type of variables and do not require any data weak model and identify the wrongly classified data points
within each cluster are similar.
prepraration 3. Adjust Weights: Increase the weight of wrongly classified data
Clustering Algorithms:
→ Simple to understand, interpret, visualize points
→ Non-linear parameters don’t effect its performance 4. Combined Models: Combine the weak models using a 1. Partition-based methods
weighted sum, where weights are based on the accuracy of (a) K-means clustering
Disadvantages:
each learner. (b) Fuzzy C-Means
→ Overfitting (High variance) occurs, when noise data
→ DT can be unstable (use bagging or boosting) because of small 5. Repeat steps 2-4 for a predefined number of iterations or until 2. Hierarchical methods
variation in data the error is minimized. (a) Agglomerative Clustering
(b) Divisive Clustering
Hyperparameters: The most common Stopping Criterion for splitting • Limitations: Sensitive to noisy data and outliers since misclassified
is a minimum of training observations per node, maximum depth of points are given more focus. 3. Density-based methods
the tree Hyperparameters: number of estimators, learning rate. (a) Density-Based methods (DBSCAN)
Ensemble Algorithms Gradient Boosting
• Uses the different training samples at each stage
K-means clustering
Ensemble methods combine multiple, simpler algorithms (weak • ”Weakness” = Residuals or Erros
learners) to obtain better performance algorithm. • Learning Focus: Instead of adjusting weights, it optimizes the The objective of K-means clustering is to minimize total intra-cluster
Bagging Boosting model by minimizing a loss function (e.g., mean squared error for or, the squared error function.
AdaBoost regression).
K X
n
Random Forest Gradient Boosting Algorithm: X (j)
Objective function → J = ∥Xi − Cj ∥2
XGBoost 1. Initialize Model: Start with an initial model (e.g., a constant
j=1 i=1
• Bagging value). Let’s say Avg.
→ It involves parallel training of multiple models independently on 2. Compute Residuals: Calculate the residuals (errors) of the Here, K is No. of clusters, n is No. of cases, Cj is centroid for
different subsets of the data. These subsets of data are drawn using current model. cluster j
the bootstrap technique. 3. Train Weak Learner: Train a weak learner on the residuals.
→ Then averaging their predictions (for regression) or majority voting 4. Update Model: Add the weak learner to the model with a
(for classification). certain learning rate.
→ It can reduce the variance and prevent overfitting by averaging 5. Repeat steps 2-4 for a fixed number of iterations or until the
out the errors of individual models. model converges.
→ Bootstrapping is drawing random sub-samples (sampling with • Limitations: Slower to train and more prone to overfitting without
replacement) from a large sample (available data) to estimate careful tuning.
quantity (parameters) of a unknown population by averaging the Hyperparameters: learning rate, number of boosting stages,
estimates from these sub-samples. maximum depth of individual trees.
Random Forest XGBoost
→ Bagged Decision Trees: Each DT may contain different no. of • Enhances gradient boosting by making it faster, more efficient, and
rows and different no. of features. more accurate..
→ Individual DTs may face overfitting i.e. have low bias (complex → Execution speed: Parallelization (It will use all cores of CPU),
model) but high variance, by ensembling a lot of DTs we are going to Cache optimization, Out of memory (Data size bigger than memory) 1. Divide data into K clusters or groups.
reduce the variance, while not increasing the bias. → Model performance: 2. Randomly select centroid for each of these K clusters.
Hyperparameters: number of trees, maximum depth of the trees - Adds regularization to balance the trade-off between fitting the 3. Assign data points to their closest cluster centroid according
training data and maintaining model simplicity. to Euclidean/ Square Euclidean/Manhattan/Cosine
• Boosting
- Auto pruning: Prevents trees from growing too large, improving 4. Calculate the centroids of the newly formed clusters.
→ It involves sequentially training of multiple models, where each
generalization and reducing the risk of overfitting. 5. Repeat steps 3 and 4 until the same centroids (convergences)
model tries to correct the errors of the previous ones.
- During training, model learns the optimal way to split data with are assigned to each cluster.
missing values as well as model learns from the patterns of missing
data and adjusts the decision boundaries accordingly.
- Efficient handling of sparse data
- Flexible: Supports a variety of loss functions and custom objective
functions
Hyperparameters: Learning Rate, Number of Trees, Maximum Depth,
Min Child Weight, Subsample, Booster Type, Early Stopping Rounds
→ K-means always converges (mostly to local minimum not to Principle Component Analysis (PCA) Neural Network
global minimum) A neural network is a type of machine learning model that mimics the
• How to choose K number of clusters in K-Means algorithm? PCA combines highly correlated variables into a new, smaller set of
constructs called principal components, which capture most of the structure and function of the human brain to recognize patterns,
→ The maximum possible number of clusters will be equal to the make decisions, and learn from data.
number of observations in the dataset. variance present in the data.
• Dimensionality reduction
Hierarchial Clustering • Feature extraction
Agglomerative method: ”Bottom-up” • Data visualization
Procedure:
1. Compute the distance or, proximity matrix
X − mean → Input Layer: The first layer that receives the input data. Each
2. Initialization: Each observation is a cluster 1. Standarize the data: Z =
SD neuron in this layer corresponds to a feature of the input data.
3. Iteration: Merge two clusters which are most similar; until all → Hidden Layers: Layers between the input and output layers where
observations are merged into a single cluster. 2. Calculate covariance-matrix of the standarized data the network learn complex patterns.
V = cov(Z T ) → Output Layer: The final layer that produces the network’s output,
Divisive method: ”Top-down”
such as a prediction or classification.
1. Compute the distance, or proximity matrix • Perceptron - the foundation of a neural network, and it is a
2. Initialization: All objects stay in one cluster 3. Find eigen-values and eigen-vectors from the single-layer neural network. An Artificial Neuron is a basic building
covariance-matrix block of a neural network.
3. Iteration: Select a cluster and split it into two sub-cluster • Neural Network - a multi-layer perceptron
until each leaf cluster contains only one observation. values, vectors = eig(V )
Proximity (distance) matrix
→ Single or ward linkage: Minimize within cluster distance 4. Feature vectors; It is simply the matrix that has columns, the
h  i eigen-vectors of the components that we decide to keep.
L(C1 , C2 ) = min D XiC1 , XjC2
5. Project data → Znew = vectorsT · Z T
→ Complete linkage: Longest distance between two points in each
Association Rule Mining
cluster. Minimize maximum distance of between cluster pairs
h  i ”Market Basket Analysis” → It uses Machine Learning models to
L(C1 , C2 ) = max D XiC1 , XjC2 analyze data for patterns or, co-occurence in a database.

→ Average linkage: Minimize average distance between cluster pairs Graphical Modelling and Network Analysis
nC1 nC2 ”Bayesian Networks”
1 X Xh  C i
L(C1 , C2 ) = D Xi 1 , XjC2
nC1 nC2 i=1 j=1

DBSCAN
→ Two parameters: ε - distance, minimum points
→ Three classifications of points:
• Core: has atleast minimum points within ε - distance including → Weights: are the real values that are attached with each
itself input/feature and they convey the importance of that feature in
• ε - distance has less than minimum points within ε - distance predicting the final output.
but can be reached by clusters. → Bias: is used for shifting the activation function towards left or
• Outlier: point that cannot be reached by cluster right.
→ Summation Function: used to bind the weights and inputs
Procedure:
together and calculate their sum.
1. Pick a random point that has not been assigned to a cluster → Activation Function: decides whether a neuron should be activated
or, designated as an Outlier. Determine if it is a Core Point. or not, and it introduces non-linearities into the network which makes
If not, label the point as Outlier. input capable of learning and performing more complex tasks.
2. Once a Core Point has been found, add all directly reachable
to its cluster. Then do neighbor jumps to each reachable
point and add them to the cluster. If an Outlier has been
added, label it as a Border Point.
3. Repeat these steps until all points are assigned a cluster or,
label as Outlier.

Dimensionality Reduction Methods


Reduce the number of input variables (attributes or features) in
dataset.
Sigmoid ReLU Tanh Recurrent Neural Network
1 ez −e−z
1+e−z
max(0, z) ez +e−z Recurrent Neural Networks (RNNs) are designed to process sequences
of data such as time series data, voice, natural language, and other
activities.

→ Softmax - used as the last activation function of a neural network


to normalize the output of a network to a probability distribution over → RNN memorize information from previous data with feedback
ezi
predicted output classes. These probabilities sum to 1 → P ez loops inside it which helps to keep data information over time.
→ If there is more than one ‘correct’ label, the sigmoid function → It has an arrow pointing to itself, indicating that the data inside
provides probabilities for all, some, or none of the labels. Convolutional Neural Network block “A” will be recursively used. Once expanded, its structure is
CNN is a neural network architecture that is well-suited for image equivalent to a chain-like structure.
How Neural Networks Work?
classification and object recognition tasks. The general CNN → Learning to store information or data over long periods of time
• Forward Propagation: The input data is passed through the
architectures are as shown below: intervals via recurrent backpropagation takes a very long time. Hence,
network, layer by layer, with each neuron applying its weights and bias
the gradient gradually vanishes as they propagate to earlier time
to the input and passing the result through the activation function.
steps. These downstream gradients relies on parameter (weight)
The final layer produces the output.
sharing for efficiency, and repeatedly multiplying values greater than
• Backpropagation: Backpropagation is an algorithm used in neural or less than 1 leads to:
networks to adjust the internal weights and biases to minimize the – Exploding gradients - model instability and overflows
error calculated by the loss function.
– Vanishing gradients - loss of learning ability
– Regression Loss: Mean Squared Error/Squared loss/ L2 loss,
Mean Absolute Error/ L1 loss, Huber Loss → A convolutional neural network starts by taking an input image, → This can be solved using:
represented as a matrix of pixel values – Gradient clipping - cap the maximum value of gradients
– Classification Loss: Binary Cross Entropy/log loss, Categorical
→ This input image is passed through convolutional layers. Here, a – ReLU - its derivative prevents gradient shrinkage for x > 0
Cross Entropy
set of filters applies to the input image to detect features like edges,
The common loss functions are summed up in the table below: – Gated cells - regulate the flow of information
textures, and patterns. Each filter produces a feature map that
highlights a specific aspect of the input image. And, also for the non-convex problem, the RNN model training
Least squared error Logistic loss Hinge loss → After each convolution, an activation function (like ReLU) is
1 confuse between local minimum and global minimum. To overcome
(y − ŷ)2 log (1 + exp(−y ŷ)) max(0, 1 − y ŷ) applied to introduce non-linearity, enabling the network to learn more
2 these problem, LSTM has been introduced as RNN languages
Linear Regression Logistic Regression SVM complex patterns. modelling learning algorithm based on the feedforward architecture.
→ This produces feature maps. Different weights lead to different
feature maps.

→ During training, the network uses a supervised learning method


where the difference (error) between the network’s predicted output
and the known expected output is calculated. This error is then
propagated back through the network via backpropagation to
compute the gradient of the loss function with respect to each → The feature maps are then passed through pooling layers, which
weight. An optimization algorithm, such as gradient descent, uses downsample the spatial dimensions by taking the maximum or • Vanishing gradient problem for RNNs. The sensitivity increases as
these gradients to update the weights and biases, reducing the error average value in small regions. This reduces the size of the feature the network backpropagates through in time. The darker the shade,
and improving the model’s accuracy over time. maps and retains essential information, making the network more the greater the sensitivity.
• Training: The process of forward propagation, loss calculation, and efficient and less sensitive to slight changes in the input.
backpropagation is repeated over many iterations, allowing the → Again, the feature maps produced by the convolutional layer and
network to learn from the data and improve its accuracy. pooling layer are then passed through multiple additional
convolutional and pooling layers, each layer learning increasingly
• To prevent overfitting, regularization can be applied by:
complex features of the input image.
– Stopping training when validation performance drops → Now, the output obtained from above is fed into a fully connected
– Dropout - randomly drop some nodes during training to prevent layer for classification, object detection, or other structural analyses.
over-reliance on a single node The final output of the network is a predicted class label or
– Embedding weight penalties into the objective function probability score for each class, depending on the task.
– Batch Normalization - stabilizes learning by normalizing inputs Question: Describe the difference between batch normalization and
to a layer layer normalization.
• Preservation of gradient information by LSTM. The sensitivity of P (Xt = 1) = 0.5, P (Xt = −1) = 0.5 – Forward Chain - train F1 , test F2 , then train F1 , F2 , test F3
the output layer can be switched on and off. • Exponential Smoothing - uses an exponentially decreasing weight
→ LSTM memorize the information for the long period of time. The → Gaussian Noise example:A continues process: Gaussian noise
to observations over time, and takes a moving average. The time t
difference between RNN and LSTM are: RNN cell has only one tanh {Xt } is a sequence of iid normal random variables with zero mean
output is st = αxt + (1 − α)st−1 , where 0 < α < 1.
layer while LSTM cell has four layers: forget gate layer, store gate and σ 2 variance; i.e., Xt ∼ N (0, σ 2 )
• Double Exponential Smoothing - applies a recursive exponential
layer, new cell state layer, output layer, and previous cell state as → Random walk: The random walk {St , t = 0, 1, 2, . . .} (starting at
filter to capture trends within a time series
shown in Figure below. zero, S0 = 0) is obtained by cumulatively summing (or ”integrating”)
st = αxt + (1 − α)(st−1 + bt−1 )
random variables; i.e., S0 = 0 and
bt = β(st − st−1 ) + (1 − β)bt−1
Time Series St = X1 + · · · + Xt , for t = 1, 2, . . .
where {Xt } is iid noise with zero mean and σ 2 variance. Note that by Triple exponential smoothing adds a third variable γ that accounts for
It is a random sequence {Xt } of real values recorded at successive seasonality.
differencing, we can recover Xt ; i.e.,
equally spaced points in time.
∇St = St − St−1 = Xt • ARIMA - models time series using three parameters (p, d, q):
→ Not every data collected with respect to time represents a time
series. Further, we have – Autoregressive - the past p values affect the next value
Methods of prediction & forecasting, time based data is Time X
!
X X – Integrated - values are replaced with the difference between
Series Modeling E(St ) = E Xt = E(Xt ) = 0=0 current and previous values, using the difference degree d (0 for
• Examples of time series: Stock Market Price, Passenger Count of t t i stationary data, and 1 for non-stationary)
– Moving Average - the number of lagged forecast errors and the
!
airlines, Temperature over time, Monthly Sales Data, X X
Quarterly/Annual Revenue, Hourly Weather Data/Wind Speed, IOT Var(St ) = Var Xt = Var(Xt ) = tσ 2 size of the moving average window q
sensors in Industries and Smart Devices, Energy Forecasting t t
• SARIMA - models seasonality through four additional
→ White Noise: We say {Xt } is a white noise; i.e., seasonality-specific parameters: P , D, Q, and the season length s
Difference between Time Series and Regression
Xt ∼ WN(0, σ 2 ), if {Xt } is uncorrelated, i.e., Cov (Xt1 , Xt2 ) = 0 for
• Time Series is time dependent. However the basic assumption of a • Prophet - additive model that uses non-linear trends to account for
any t1 and t2 with E[Xt ] = 0 and Var(Xt = σ 2 ).
linear regression model is that the observations are independent. multiple seasonalities such as yearly, weekly, and daily.
• Along with an increasing or decreasing trend, most Time Series Note: Every IID(0, σ 2 ) sequence is WN(0, σ 2 ) but not conversely. → Robust to missing data and handles outliers well.
have some form of seasonality trends • Moving Average Smoother This is an essentially non-parametric → Can be represented as: y(t) = g(t) + s(t) + h(t) + ϵ(t), with four
Note: method for trend estimation. It takes averages of observations around distinct components for the growth over time, seasonality, holiday
→ Predicting a time series using regression techniques is not a good t; i.e., it smooths the series. For example, let effects, and error. This specification is similar to a generalized
approach. 1 additive model.
→ Time series forecasting is the use of a model to predict future Xt = (Wt−1 + Wt + Wt+1 )
3 • Generalized Additive Model - combine predictive methods while
values based on previously observed values. preserving additivity across variables, in a form such as
which is a three-point moving average of the white noise series Wt .
→ A stochastic process is defined as a collection of random variables y = β0 + f1 (x1 ) + · · · + fm (xm ), where functions can be non-linear.
→ AR(1) model (Autoregression of order 1): Let
X = {Xt : t ∈ T } defined on a common probability space, taking → GAMs also provide regularized and interpretable solutions for
Xt = 0.6Xt−1 + Wt regression and classification problems.
values in a common set S (the state space), and indexed by a set T ,
often either N or [0, ∞) and thought of as time (discrete or where Wt is a white noise series. It represents a regression or Tutorial: Complete Guide on Time Series Analysis in Python
continuous respectively) (Oliver, 2009). prediction of the current value Xt of a time series as a function of the
past two values of the series.
Time Series Statistical Models
A time series model specifies the joint distribution of the sequence
Stationary Process
{Xt } of random variables; e.g., Extracts characteristics from time-sequenced data, which may exhibit
P (X1 ≤ x1 , . . . , Xt ≤ xt ) for all t and x1 , . . . , xt the following characteristics:
– Stationarity - statistical properties such as mean, variance,
Typically, a time series model can be described as auto-correlation are constant over time, an autocovariance that
Xt = mt + st + Yt does not depend on time, and no trend or seasonality
– Non-Stationary - There are 2 major reasons behind the
where mt : trend component; st : seasonal component; Yt : Zero-mean non-stationary of a Time Series
error – Trend - varying mean over time (mean is not constant)
Note: The following are some zero-mean models – Seasonality - variations at specific time-frames (standard
→ iid noise: The simplest time series model is the one with no trend deviation is not constant)
or seasonal component, and the observations Xt s are simply
– Trend - Trend is a general direction in which something is
independent and identically distribution random variables with zero
developing or changing.
mean. Such a sequence of random variable {Xt } is referred to as iid
– Seasonality - Any predictable change or pattern in a time series
noise.
Y Y that recurs or repeats over a specific time period (calendar times)
P (X1 ≤ x1 , . . . , Xt ≤ xt ) = P (Xt ≤ xt ) = F (xt ) occurring at regular intervals less than a year
t t – Cyclicality - variations without a fixed time length, occurring in
where F (·) is the cdf of each Xt . Further E(Xt ) = 0 for all t. We periods of greater or less than one year
denote such sequence as Xt ∼ IID(0, σ 2 ). IID noise is not interesting – Autocorrelation - degree of linear similarity between current and
for forecasting since Xt |X1 , . . . , Xt−1 = Xt . lagged values
→ iid noise example: A binary (discrete) process {Xt } is a sequence • CV must account for the time aspect, such as for each fold Fx :
of iid random variables Xt s with – Sliding Window - train F1 , test F2 , then train F2 , test F3
Natural Language Processing – Tokenization - splits text into individual words (tokens) and word
fragments.
NLP is the discipline of building machines that can manipulate
• Sentence-level tokenization involves splitting a text into
human language — or data that resembles human language — in the
individual sentences.
way that it is written, spoken, and organized. It evolved from
• Word-level tokenization involves splitting each sentence into
computational linguistics.
individual words or tokens.
NLP Applications
– Lemmatization - reduces words to its base form based on
dictionary definition (am, are, is → be) – tf-idf - In contrast, with TF-IDF, we weight each word by its
– Stemming - reduces words to its base form without context importance. To evaluate a word’s significance, we consider two
(ended → end) things:
– Language Detection 1. Term Frequency: How important is the word in the
Advance Text Processing document? TF(word in a document) =
Number of occurrences of that word in document
Number of words in document
– POS Tagging 2. Inverse Document Frequency: How important is the word
in the whole corpus (a collection of documents)?
IDF(word in a corpus) =
number of documents in the corpus
– Parse Tree log( number of documents that include the word
)
– Coreference Resolution Note: A word is important if it occurs many times in a document.
But that creates a problem. Words like “a” and “the” appear
Feature Extraction often. And as such, their TF score will always be high. We resolve
this issue by using Inverse Document Frequency, which is high if
→ Feature Extraction = Text Representation = Text Vectorization the word is rare and low if the word is common across the corpus.
Common Terms: The TF-IDF score of a term is the product of TF and IDF.
• Corpus • Vocabulary • Document • Word Cosine Similarity - measures similarity between vectors, calculated
A·B
as cos(θ) = ||A||||B|| , which ranges from o to 1

Evolution of NLP

→ Most conventional machine learning techniques work on the


features – generally numbers that describe a document in relation to
the corpus that contains it – created by either Bag-of-Words, TF-IDF,
or generic (custom) feature engineerings such as document length,
Challenges in NLP → CountVectorizer - Bag of Words
word polarity, and metadata (for instance, if the text has associated
→ TfidfTransformer - TF-IDF values
• The 3 stages of an NLP pipeline are: Text Processing → Feature tags or scores).
→ TfidfVectorizer - Bag of Words AND TF-IDF values
Extraction → Modeling. Note: Deep learning does not require to do feature engineering
Word Embedding
– Bag-of-words - counts the number of times each word or n-gram
(combination of n words) appears in a document. Word embeddings are often based on neural network models in deep
learning.
→ Based on CBOW, Skip gram: Word2vec, GloVe, fastText
Text Processing – Continuous bag-of-words (CBOW) - predicts the word given its
context
Take raw input text, clean it, normalize it, and convert it into a form – skip-gram - predicts the context given a word
that is suitable for feature extraction.
• word2vec - trains iteratively over a corpus of text to learn the
Libraries: nltk, spacy
association between the words, and preserve the semantic information
– Lower casing as well as contextual meanings of words within a given corpus of text.
– Removing other stuff like: punctuations, tags, URLs, etc depends → They are numerical representations of words and phrases allowing
on the problem similar words to have similar vector representations.
– Convert chat words used in social media to a normal word → It uses the cosine similarity metric to measure semantic similarity.
– Spelling correction using libraries like TextBlob If the cosine angle is one, it means that the words are overlapping. ,
– Stop words - removes common and irrelevant words (the, is) – n-gram - predicts the next term in a sequence of n terms based
such that king − man + woman ≈ queen
Note: Do not remove stop words when using POS Tagging in text on Markov chains
→ Markov Chain - stochastic and memoryless process that Note: According to research CBOW is used when small dataset is
processing. available.
predicts future events based only on the current state
References
[1] Rahul Beakta. “Big data and hadoop: A review
paper”. In: International Journal of Computer Science
& Information Technology 2.2 (2015), pp. 13–15.
[2] M. Sundermeyer, H. Ney, and R. Schlüter. “From
Feedforward to Recurrent LSTM Neural Networks for
Language Modeling”. In: IEEE/ACM Transactions on
Audio, Speech, and Language Processing 23.3 (Mar.
2015), pp. 517–529. issn: 2329-9290. doi:
10.1109/TASLP.2015.2400218.
[3] Varsha B Bobade. “Survey paper on big data and
• GloVe (Global Vectors for Word Representation) - GloVe operates Hadoop”. In: Int. Res. J. Eng. Technol 3.1 (2016),
on the idea that words that frequently co-occur together, sharing
similar contexts, tend to have related meanings. It builds a global pp. 861–863.
co-occurrence matrix that captures the frequency of word [4] D. Dong, Z. Sheng, and T. Yang. “Wind Power
co-occurrences within a context window across the entire corpus
Prediction Based on Recurrent Neural Network with
→ Based on transformer architecture
• BERT - accounts for word order and trains on subwords, and unlike
Long Short-Term Memory Units”. In: 2018
word2vec and GloVe, BERT outputs different vectors for different International Conference on Renewable Energy and
uses of words (cell phone vs. blood cell) Power Engineering (REPE). Nov. 2018, pp. 34–38.
Sentiment Analysis doi: 10.1109/REPE.2018.8657666.
Extracts the attitudes and emotions from text [5] Analog Devices. Training Convolutional Neural
• Polarity - measures positive, negative, or neutral opinions Networks: What is Machine Learning? Part 2. Analog
– Valence shifters - capture amplifiers or negators such as ’really Dialogue. url:
fun’ or ’hardly fun’
https://www.analog.com/en/analog-
• Sentiment - measures emotional states such as happy or sad
dialogue/articles/training-convolutional-
• Subject-Object Identification - classifies sentences as either
subjective or objective neural-networks-what-is-machine-learning-
part-2.html.
Topic Modelling
[6] deeplearning.ai. Natural Language Processing
Captures the underlying themes that appear in documents
• Latent Dirichlet Allocation (LDA) - generates k topics by first
Resources. deeplearning.ai. url: https:
assigning each word to a random topic, then iteratively updating //www.deeplearning.ai/resources/natural-
assignments based on parameters α, the mix of topics per document, language-processing/.
and β, the distribution of words per topic
• Latent Semantic Analysis (LSA) - identifies patterns using tf-idf [7] Edureka. MapReduce Tutorial. Edureka. url: https:
scores and reduces data to k dimensions through SVD //www.edureka.co/blog/mapreduce-tutorial/.
NLP Tutorial [8] Edureka. Top 50 Hadoop Interview Questions (2016).
Duplicate Question Pairs - Quora Questions Pairs: NLP Pipeline Edureka. url:
https://www.edureka.co/blog/interview-
questions/top-50-hadoop-interview-
questions-2016/.
[9] Nilay Chauhan. Getting Started with NLP Pipelines.
Kaggle. url: https:
//www.kaggle.com/code/nilaychauhan/getting-
started-with-nlp-pipelines.

Last Updated September 19, 2024

You might also like