Machine Learning Cheat Sheet PDF
Machine Learning Cheat Sheet PDF
What is Bias?
• Error between average model prediction and ground truth
• The bias of the estimated function tells us the capacity of the underlying model to
predict the values
What is Variance?
• Average variability in the model prediction for the given dataset
• The variance of the estimated function tells you how much the function can adjust
to the change in the dataset
High Bias Overly-simplified Model
Under-fitting
High error on both test and train data
Minimum Error
Positive Positive TP + FP TN + FP
(Prec x Rec) TP + TN
F1 score = 2x Accuracy =
(Prec + Rec) TP + FN + FP + TN
False True
0
Negative Negative TN TP
Specificity = Recall, Sensitivity =
TN + FP True +ve rate TP + FN
Possible solutions
1. Data Replication: Replicate the available data until the Blue: Label 1
number of samples are comparable Green: Label 0
2. Synthetic Data: Images: Rotate, dilate, crop, add noise to Blue: Label 1
existing input images and create new data Green: Label 0
3. Modified Loss: Modify the loss to reflect greater error when 𝑙𝑜𝑠𝑠 = 𝑎 ∗ 𝒍𝒐𝒔𝒔𝒈𝒓𝒆𝒆𝒏 + 𝑏 ∗ 𝒍𝒐𝒔𝒔𝒃𝒍𝒖𝒆 𝑎>𝑏
misclassifying smaller sample set
4. Change the algorithm: Increase the model/algorithm complexity so that the two classes are perfectly
separable (Con: Overfitting)
Increase model
complexity
No straight line (y=ax) passing through origin can perfectly Straight line (y=ax+b) can perfectly separate data.
separate data. Best solution: line y=0, predict all labels blue Green class will no longer be predicted as blue
Figure 1 Figure 2
Feature # 1 (F1)
FeFeature # 1
Variance
Variance
1
e#
2
ur
#
re
at
atu
Fe
ew
w
Ne
N
at
Fe F2 F2 (new feature # 1) and project the data on
w
w
Ne
Ne
Source: https://www.cheatsheets.aqeel-anwar.com
Cheat Sheet – Bayes Theorem and Classifier
What is Bayes’ Theorem?
• Describes the probability of an event, based on prior knowledge of conditions that might be
related to the event.
P(A B)
• How the probability of an event changes when
we have knowledge of another event Posterior
Probability
P(A) P(A B)
Usually, a better
estimate than P(A)
Bayes’ Theorem
Example
• Probability of fire P(F) = 1%
• Probability of smoke P(S) = 10%
Likelihood P(A) Evidence
• Prob of smoke given there is a fire P(S F) = 90%
• What is the probability that there is a fire given P(B A) Prior P(B)
we see a smoke P(F S)? Probability
Naïve Bayes’ theorem assumes the features (x1, x2, … ) are i.i.d. i.e
Source: https://www.cheatsheets.aqeel-anwar.com
Cheat Sheet – Regression Analysis
What is Regression Analysis?
Fitting a function f(.) to datapoints yi=f(xi) under some error function. Based on the estimated
function and error, we have the following types of regression
1. Linear Regression:
Fits a line minimizing the sum of mean-squared error
for each datapoint.
2. Polynomial Regression:
Fits a polynomial of order k (k+1 unknowns) minimizing
the sum of mean-squared error for each datapoint.
3. Bayesian Regression:
For each datapoint, fits a gaussian distribution by
minimizing the mean-squared error. As the number of
data points xi increases, it converges to point
estimates i.e.
4. Ridge Regression:
Can fit either a line, or polynomial minimizing the sum
of mean-squared error for each datapoint and the
weighted L2 norm of the function parameters beta.
5. LASSO Regression:
Can fit either a line, or polynomial minimizing the the
sum of mean-squared error for each datapoint and the
weighted L1 norm of the function parameters beta.
6. Logistic Regression:
Can fit either a line, or polynomial with sigmoid
activation minimizing the binary cross-entropy loss for
each datapoint. The labels y are binary class labels.
Visual Representation:
Linear Regression Polynomial Regression Bayesian Linear Regression Logistic Regression
Label 1
y
y
Label 0
x x x x
Summary:
What does it fit? Estimated function Error Function
Linear A line in n dimensions
Polynomial A polynomial of order k
Bayesian Linear Gaussian distribution for each point
Ridge Linear/polynomial
LASSO Linear/polynomial
Logistic Linear/polynomial with sigmoid
• L1 Regularization: Prevents the weights from getting too large (defined by L1 norm). Larger
the weights, more complex the model is, more chances of overfitting. L1 regularization
introduces sparsity in the weights. It forces more weights to be zero, than reducing the the
average magnitude of all weights
• Entropy: Used for the models that output probability. Forces the probability distribution
towards uniform distribution.
CNN Template:
Most of the commonly used hidden layers (not all) follow a
pattern
1. Layer function: Basic transforming function such as
convolutional or fully connected layer.
a. Fully Connected: Linear functions between the input and the
output.
a. Convolutional Layers: These layers are applied to 2D (3D) input feature maps. The trainable weights are a 2D (3D)
kernel/filter that moves across the input feature map, generating dot products with the overlapping region of the input
feature map.
b.Transposed Convolutional (DeConvolutional) Layer: Usually used to increase the size of the output feature map
(Upsampling) The idea behind the transposed convolutional layer is to undo (not exactly) the convolutional layer
Fully Connected Layer Convolutional Layer
w11*x
x1 1+ b1
+ b1 y1
w21*x2
x2
b1
3+
1*x
x3 w3
1.5
4.0 0.4
1.0
2.0
0.5 0.2
VGGNet – 2014
Why: VGGNet was born out of the need to reduce the # of
parameters in the CONV layers and improve on training time
What: There are multiple variants of VGGNet (VGG16, VGG19, etc.)
How: The important point to note here is that all the conv kernels are
of size 3x3 and maxpool kernels are of size 2x2 with a stride of two.
ResNet – 2015
Why: Neural Networks are notorious for not being able to find a
simpler mapping when it exists. ResNet solves that.
What: There are multiple versions of ResNetXX architectures where
‘XX’ denotes the number of layers. The most used ones are ResNet50
and ResNet101. Since the vanishing gradient problem was taken care of
(more about it in the How part), CNN started to get deeper and deeper
How: ResNet architecture makes use of shortcut connections do solve
the vanishing gradient problem. The basic building block of ResNet is
a Residual block that is repeated throughout the network.
Filter
Concatenation
Weight layer
f(x) x 1x1
3x3
Conv
5x5
Conv
1x1 Conv
+ Previous
f(x)+x Layer
2.Boosting: Trains N different weak models (usually of same types – homogenous) with the complete dataset in a
sequential order. The datapoints wrongly classified with previous weak model is provided more weights to that they can
be classified by the next weak leaner properly. In the test phase, each model is evaluated and based on the test error of
each weak model, the prediction is weighted for voting. Boosting methods decreases the bias of the prediction.
3.Stacking: Trains N different weak models (usually of different types – heterogenous) with one of the two subsets of the
dataset in parallel. Once the weak learners are trained, they are used to trained a meta learner to combine their
predictions and carry out final prediction using the other subset. In test phase, each model predicts its label, these set of
labels are fed to the meta learner which generates the final prediction.
The block diagrams, and comparison table for each of these three methods can be seen below.
Ensemble Method – Boosting Ensemble Method – Bagging
Input Dataset Step #1 Input Dataset
Step #1 Create N subsets
Assign equal weights Complete dataset from original Subset #1 Subset #2 Subset #3 Subset #4
to all the datapoints dataset, one for each
in the dataset weak model
Uniform weights
Step #2
Train each weak
Weak Model Weak Model Weak Model Weak Model
Step #2a Step #2b model with an
Train a weak model Train Weak • Based on the final error on the independent #1 #2 #3 #4
with equal weights to trained weak model, calculate a subset, in
Model #1 parallel
all the datapoints scalar alpha.
• Use alpha to increase the weights of
wrongly classified points, and
decrease the weights of correctly
alpha1 Adjusted weights classified points
Step #3
In the test phase, predict from
each weak model and vote their Voting
Step #3b predictions to get final prediction
Step #3a Train Weak • Based on the final error on the
Train a weak model Model #2 trained weak model, calculate a
with adjusted weights scalar alpha.
on all the datapoints • Use alpha to increase the weights of
in the dataset wrongly classified points, and Final Prediction
decrease the weights of correctly
alpha2 Adjusted weights classified points
Train Weak
Step #(n+1)a Model #4 Step #2
Train a weak model Train each weak
with adjusted weights model with the
Train Weak Train Weak Train Weak Train Weak
on all the datapoints weak learner Model #1 Model #2 Model #3 Model #4
in the dataset dataset
alpha3
x x x x Input Dataset
Subset #1 – Weak Learners Subset #2 – Meta Learner
Step #n+2
In the test phase, predict from each
weak model and vote their predictions
weighted by the corresponding alpha to
get final prediction Step #3
Voting Train a meta-
learner for which Trained Weak Trained Weak Trained Weak Trained Weak
the input is the
outputs of the Model Model Model Model
weak models for #1 #2 #3 #4
the Meta Learner
dataset
Final Prediction
4. Queue
Enqueue() Dequeue() • A queue is a sequential data structure that maintains the order
before 0 2 before 0 2 12 of elements as they were inserted in
• First In First Out (FIFO), the element to be inserted first, will
18 end front end front 12 the first one to get removed from the queue
after 18 0 2 after 0 2 • Whenever an element is added (Enqueue()) it is added to the
end of the queue. On the other hand, element removal
end front end front
(Dequeue()) is done from the front of the queue.
• A real-life example is a check-out line at a grocery store
5. HashTable
• Creates paired assignments (key mapped to values) so the
pairs can be accessed in constant time address value
key0 val0 Hash
• For each (key, value) pair, the key is passed through a 0x00 val0
hash function to create a unique physical address for the key1 val1 Function 0x02 val1
value to be stored in the memory. 0x04 val2
key2 val2
fhash … …
• Hash function can end up generating the same physical
address for different keys. This is called a collision.
root
4. Tree
depth=3 12 • Maintains a hierarchical relation between its elements.
• Root Node − The node at the top of the tree
• Parent Node − Any node that has at least one child
2 0
6 parent
• Child Node − The successor of a parent node is known as a child node. A
node can be both a parent and a child node. The root is never a child node.
6 18 3 8 children • Leaf Node− The node which does not have any child node.
sub-tree
• Traversing − Passing through the nodes in a certain order, e.g BFS, DFS
12 18
4. Graph
• A graph is a pair of sets (V, E), where V is set of all the vertices, E is set of all edges. x Node/Vertex
• A neighbor of a node is set of all vertices connected with that node through an edge. Edge
• As opposed to trees, a graph can be cyclic, which means starting from a node and
following the edges, you can end up on the same node 0 2 6
1 month 3 months
Fig. 1 – Preparation Timeline for Coding Interviews
• Review Data structures and Complexities:
The following 7 data structures are necessary for the interview, and their time/space complexity
• List/Arrays, Linked List, Hash Table/dictionary, Tree, Graph, Heap, Queue
• Click here for tutorial.
• Practice coding questions:
• Multiple online resources such as LeetCode.com, InterviewBit.com, HackerRank.com etc.
• Pick one online resource and aim for easy and medium coding questions (approx. 100-150).
• Beginners start preparing 2-3 months before the interview, and intermediates about 1 month.
• Note:
• From my personal experience, paid subscription of LeetCode.com was worth it.
• Facebook, Uber, Google and Microsoft tagged question of LeetCode covered almost 90% of the
questions asked
Part 2 – How to answer a coding question?*
• Listen to the question
The interviewer will explain the question with an example. Note down the important points.
• Talk about your understanding of the question
Repeat the question and confirm your understanding. Ask clarifying questions such as
1. Input/Output data type limitations
2. Input size/length limitations
3. Special/Corner cases
• Discuss your approach
Walk through how would you approach the problem and ask the interviewer if he agrees with it.
Talk about the data structure you prefer and why. Discuss the solution with the bigger picture.
• Start coding
Ask the interviewer if you could start coding. Define useful functions and explain as you write.
Think out loud so the interviewer can evaluate your thought process.
• Discuss the time and space complexity
Discuss the time and space complexity in terms of Big O for your coded approach.
• Optimize the approach
If your approach is not the most optimized one, the interviewer will hint you a few
improvements. Pay attention to hints and try to optimize your code.
Discuss Time
Listen & Walk through
& Space
Repeat your approach
complexity
Ask Clarifying
Start Coding Optimize
Questions
Source: https://www.cheatsheets.aqeel-anwar.com
How to prepare for
1/4 behavioral interview?
Collect stories, assign keywords, practice
the STAR format
Keywords List important keywords that will be populated with your personal
stories. Most common keywords are given in the table below
Conflict Compromise to
Negotiation Creativity Flexibility Convincing
Resolution achieve goal
Another team Adjust to a
Handling Challenging Working with
priorities not colleague Take Stand
Crisis Situation difficult people
aligned style
Handling –ve Coworker Working with a Your Influence
Your strength
feedback view of you deadline weakness Others
Handling Converting Decision
Handling Conflict Mentorship/
unexpected challenge to without enough
failure Resolution Leadership
situation opportunity data
Stories
1. List all the organizations you have been a part of. For example
1. Academia: BSc, MSc, PhD
2. Industry: Jobs, Internship
3. Societies: Cultural, Technical, Sports
2. Think of stories from step 1 that can fall into one of the keywords categories. The
more stories the better. You should have at least 10-15 stories.
3. Create a summary table by assigning multiple keywords to each stories. This will help
you filter out the stories when the question asked in the interview. An example can be
seen below
Story 1: [Convincing] [Take Stand] [influence other]
Story 2: [Mentorship] [Leadership]
Story 3: [Conflict resolution] [Negotiation]
Story 4: [decision-without-enough-data]
STAR Format
Write down the stories in the STAR format as explained in the 2/4 part of this cheat
sheet. This will help you practice the organization of story in a meaningful way.
Example: “Tell us about a time when you had to convince senior executives”
S
“I worked as an intern in XYZ company in
Situation the summer of 2019. The project details
provided to me was elaborative. After
Explain the situation and some initial brainstorming, and research I
realized that the project approach can be
provide necessary context for modified to make it more efficient in
terms of the underlying KPIs. I decided to
your story. talk to my manager about it.”
T
Task approach and how it could improve the
KPIs. I was able to convince him. He
Explain the task and your asked me if I will be able to present my
proposed approach for approval in front of
responsibility in the the higher executives. I agreed to it. I was
working out of the ABC(city) office and
situation the executives need to fly in from
XYZ(city) office.”
A
executives to know better about their area
of expertise so that I can convince them
Walk through the steps and accordingly. I prepared an elaborative 15
slide presentation starting with explaining
actions you took to address their approach, moving onto my proposed
the issue approach and finally comparing them on
preliminary results.
R
was better than the initial one. The
executives proposed a few small changes
State the outcome of the to my approach and really appreciated my
result of your actions stand. At the end of my internship, I was
selected among the 3 out of 68 interns
who got to meet the senior vice president
of the company over lunch.”
How to
2 Based on all the organizations you have been a part of,
think of all the stories that fall under the keywords above
for the 3 Practice each story using the STAR format. You will have
to answer the question following this format.