ML Interview Cheat Sheet
ML Interview Cheat Sheet
h Variance
High Overly-complex Model
Over-fitting
Low error on train data and high on test
Starts modelling the noise in the input
Minimum Error
$sR
zs$st{
eB
<$YSzs$st{
Page 1 of 9
Cheatt Sheett Imbalanced
d Data
a in
n Classification
Blue: Label 1
Positive Positive TP + FP TN + FP
(Prec x Rec) TP + TN
F1
1 scoree = 2x Accuracy =
(Prec + Rec) TP + FN + FP + TN
False True
0
Negative Negative TN TP
Specificity
y= Recall,, Sensitivity =
TN +FP Truee +ve rate TP + FN
Possiblee solutions
1. Data
a Replication:: Replicate the available data until the Blue: Label 1
number of samples are comparable Green: Label 0
2. Syntheticc Data:: Images: Rotate, dilate, crop, add noise to Blue: Label 1
existing input images and create new data Green: Label 0
3. Modified
d Loss:: Modify the loss to reflect greater error when
misclassifying smaller sample set
4. Changee thee algorithm:: Increase the model/algorithm complexity so that the two classes are perfectly
separable (Con: Overfitting)
Increase model
complexity
No straight line (y=ax) passing through origin can perfectly Straight line (y=ax+b) can perfectly separate data.
separate data. Bestt solution:: line y=0, predict all labels blue Green class will no longer be predicted as blue
Source: https://www.cheatsheets.aqeel-anwar.com
Page 2 of 9
Cheatt Sheett PCA
A Dimensionality
y Reduction
Whatt iss PCA?
Based on the dataset find a new set of orthogonal feature vectors in such a way that the
data spread is maximum in the direction of the feature vector (or dimension)
Rates the feature vector in the decreasing order of data spread (or variance)
The datapoints have maximum variance in the first feature vector, and minimum variance
in the last feature vector
The variance of the datapoints in the direction of feature vector can be termed as a
measure of information in that direction.
Steps
1. Standardize the datapoints
2. Find the covariance matrix from the given datapoints
3. Carry out eigen-value decomposition of the covariance matrix
4. Sort the eigenvalues and eigenvectors
Figure 1 Figure 2
Feature # 1 (F1)
Fe Feature # 1
Variance
Variance
1
# e
2
ur
e #
at
ur
at
Fe
ew
w
Ne
N
#
u
ur
e minimum standard deviation of datapoints
at
at
Fe
Source: https://www.cheatsheets.aqeel-anwar.com
Page 3 of 9
Cheatt Sheett Bayess Theorem
m and
d Classifier
Whatt iss Bayes Theorem?
Describes the probability of an event, based on prior knowledge of conditions that might be
related to the event.
P(A B)
How the probability of an event changes when
we have knowledge of another event Posterior
Probability
P(A) P(A B)
Usually a better
estimate than P(A)
Bayes Theorem
Example
Probability of fire P(F) = 1%
Probability of smoke P(S) = 10%
Likelihood P(A) Evidence
Prob of smoke given there is a fire P(S F) = 90%
What is the probability that there is a fire given P(B A) Prior P(B)
we see a smoke P(F ( S)?
) Probability
Maximum m Likelihood
d Estimation
n (MLE)
The MAP estimate of the random variable y, given that we have observed iid (x1, x2, x3, ), is
given by. We assume we dont have any
y prior
p knowledge
g of the quantity being estimated.
ˆ y that maximizes only the
MLE
likelihood
MLE is a special case of MAP where our prior is uniform (all values are equally likely)
Bayes theorem assumes the features (x1, x2, x3, ) are i.i.d. i.e
Source: https://www.cheatsheets.aqeel-anwar.com
Page 4 of 9
Cheatt Sheett Regression
n Analysis
What is Regression Analysis?
Fitting a function f(.) to datapoints yi=f(xi) under some error function. Based on the estimated
function and error, we have the following types of regression
1. Linear Regression:
Fits a line minimizing the sum of mean-squared error
for each datapoint.
2. Polynomial Regression:
Fits a polynomial of order k (k+1 unknowns) minimizing
the sum of mean-squared error for each datapoint.
3. Bayesian Regression:
For each datapoint, fits a gaussian distribution by
minimizing the mean-squared error. As the number ber of
data points xi increases, it conv
converges to point
estimates i.e.
4. Ridge Regression:
Can fit either a line, or polynomial minimizing the sum
of mean-squared error for each datapoint and the
weighted L2 norm of the function parameters beta.
5. LASSO Regression:
Can fit either a line, or polynomial minimizing the the
sum of mean-squared error for each datapoint and the
weighted L1 norm of the function parameters beta.
6. Logistic Regression (NOT regression, but classification):
Can fit either a line, or polynomial with sigmoid
activation minimizing the sum of mean-squared error for
each datapoint. The labels y are binary class labels.
Visual Representation:
Linear Regression Polynomial Regression Bayesian Linear Regression Logistic Regression
Label 1
y
y
Label 0
x x x x
Summary:
Whatt doess itt fit? Estimated
d function Error Function
Linear A line in n dimensions
Polynomial A polynomial of order k
Bayesian Linear Gaussian distribution for each point
Ridge Linear/polynomial
LASSO Linear/polynomial
Logistic Linear/polynomial with sigmoid
Source: https://www.cheatsheets.aqeel-anwar.com
Page 5 of 9
Cheatt Sheett Regularization
n in
n ML
Whatt iss Regularization
n in
n ML?
Regularization is an approach to address over-fitting in ML.
Overfitted model fails to generalize estimations on test data
When the underlying model to be learned is low bias/high
variance, or when we have small amount of data, the =tITP$$t FZRSi$Y !%TP$$t
estimated model is prone to over-fitting. _PIS$PSR$9S _PIS$PSR$9S
BPSIssRS$RSR8sjj BPSIssRS$RSjs
Regularization reduces the variance of the model
Types of Regularization: Figure 1. Overfitting
1. Modify the loss function:
L2 Regularization: Prevents the weights from getting too large (defined by L2 norm). Larger
p
the weights, more complex the model is,, more chances of overfitting.
g
L1 Regularization: Prevents the weights from getting too large (defined by L1 norm). Larger
the weights, more complex the model is, more chances of overfitting. L1 regularization
introduces sparsity in the weights. It forces more weights to be zero, than reducing the the
g
average magnitude of all weights
Entropy: Used for the models that output probability. Forces the probability distribution
niform distribution.
towards uniform
Page 6 of 9
Cheatt Sheett Famouss CNNs
AlexNet 2012
Why: AlexNet was born out of the need to improve the results of
the ImageNet challenge.
What: The network consists of 5 Convolutional (CONV) layers and 3
Fully Connected (FC) layers. The activation used is the Rectified
Linear Unit (ReLU).
How: Data augmentation is carried out to reduce over-fitting, Uses
Local response localization.
VGGNet 2014
Why: VGGNet was born out of the need to reduce the # of
parameters in the CONV layers and improve on training time
What: There are multiple variants of VGGNet (VGG16, VGG19, etc.)
How: The important point to note here is that all the conv kernels are
of size 3x3 and maxpool kernels are of size 2x2 with a stride of two.
ResNet 2015
Why: Neural Networks are notorious for not being able to find a
simpler mapping when it exists. ResNet solves that.
What: There are multiple versions of ResNetXX architectures where
XX denotes the number of layers. The most used ones are ResNet50
and ResNet101. Since the vanishing gradient problem was taken care of
(more about it in the How part), CNN started to get deeper and deeper
How: ResNet architecture makes use of shortcut connections do solve
the vanishing gradient problem. The basic building block of ResNet is
a Residual block that is repeated throughout the network.
Filter
Concatenation
Weight layer
3x3 5x5
f(x) x 1x1
Conv Conv
1x1 Conv
+ Previous
f(x)+x Layer
Source: https://www.cheatsheets.aqeel-anwar.com
Page 7 of 9
Cheatt Sheett Convolutionall Neurall Network
Convolutional Neural Network:
The data gets into the CNN through the input layer and passes
through various hidden layers before getting to the output layer.
The output of the network is compared to the actual labels in
terms of loss or error. The partial derivatives of this loss w.r.t the
trainable weights are calculated, and the weights are updated
through one of the various methods using backpropagation.
CNN Template:
Most of the commonly used hidden layers (not all) follow a
pattern
1. Layer function: Basic transforming function such as
convolutional or fully connected layer.
a. Fully Connected: Linear functions between the input and the
output.
a. Convolutional Layers: These layers are applied to 2D (3D) input feature maps. The trainable weights are a 2D (3D)
kernel/filter that moves across the input feature map, generating dot products with the overlapping region of the input
feature map.
b.Transposed Convolutional (DeConvolutional) Layer: Usually used to increase the size of the output feature map
(Upsampling) The idea behind the transposed convolutional layer is to undo (not exactly) the convolutional layer
Fully Connected Layer Convolutional Layer
w11*x
x1 1+ b1
+ b1 y1
w21*x2
x2
1
3 +b
1 *x
x3 w3
1.5
4.0 0.4
1.0
2.0
0.5 0.2
Source: https://www.cheatsheets.aqeel-anwar.com
Page 8 of 9
Cheatt Sheett Ensemblee Learning
g in
n ML
What is Ensemble Learning? Wisdom of the crowd
Combine multiple weak models/learners into one predictive model to reduce bias, variance and/or improve accuracy.
2.Boosting: Trains N different weak models (usually of same types homogenous) with the complete dataset in a
sequential order. The datapoints wrongly classified with previous weak model is provided more weights to that they can
be classified by the next weak leaner properly. In the test phase, each model is evaluated and based on the test error of
each weak model, the prediction is weighted for voting. Boosting methods decreases the bias of the prediction.
3.Stacking: Trains N different weak models (usually of different types heterogenous) with one of the two subsets of the
dataset in parallel. Once the weak learners are trained, they are used to trained a meta learner to combine their
predictions and carry out final prediction using the other subset. In test phase, each model predicts its label, these set of
labels are fed to the meta learner which generates the final prediction.
The block diagrams, and comparison table for each of these three methods can be seen below.
Ensemble Method Boosting Ensemble Method Bagging
Input Dataset Step #1 Input Dataset
Step #1 Create N subsets
Assign equal weights Complete dataset from original Subset #1 Subset #2 Subset #3 Subset #4
to all the datapoints dataset, one for each
in the dataset weak model
Uniform weights
Step #2
Train each weak
Weak Model Weak Model Weak Model Weak Model
Step #2a Step #2b model with an
Train a weak model Train Weak Based on the final error on the independent #1 #2 #3 #4
with equal weights to trained weak model, calculate a subset, in
Model #1 parallel
all the datapoints scalar alpha.
Use alpha to increase the weights of
wrongly classified points, and
decrease the weights of correctly
alpha1 Adjusted weights classified points
Step #3
In the test phase, predict from
each weak model and vote their Voting
Step #3b predictions to get final prediction
Step #3a Train Weak Based on the final error on the
Train a weak model Model #2 trained weak model, calculate a
with adjusted weights scalar alpha.
on all the datapoints Use alpha to increase the weights of
in the dataset wrongly classified points, and Final Prediction
decrease the weights of correctly
alpha2 Adjusted weights classified points
Train Weak
Step #(n+1)a Model #4 Step #2
Train a weak model Train each weak
with adjusted weights model with the
Train Weak Train Weak Train Weak Train Weak
on all the datapoints weak learner Model #1 Model #2 Model #3 Model #4
in the dataset dataset
alpha3
x x x x Input Dataset
Subset #1 Weak Learners Subset #2 Meta Learner
Step #n+2
In the test phase, predict from each
weak model and vote their predictions
weighted by the corresponding alpha to
get final prediction Step #3
Voting Train a meta-
learner for which Trained Weak Trained Weak Trained Weak Trained Weak
the input is the
outputs of the Model Model Model Model
weak models for #1 #2 #3 #4
the Meta Learner
dataset
Final Prediction
Source: https://www.cheatsheets.aqeel-anwar.com
Page 9 of 9