Pattern Revision
Pattern Revision
(Introduction)
• Deep Learning
– Multi-layered neural networks
– end-to-end solution (Works on raw data)
– Required big data and high computational power
When to use Deep Learning?
1- Big amount of data (expensive)
2- Availability of high computational power (expensive)
3- Lack of domain understanding
4- Complex problems
Note: in deep learning the more amount of data the more performance but old learning
techniques is not the same.
Steps:
1- Raw Data
2- Pre-processing
3- Feature Extraction
4- Training a classifier
Unsupervised Learning
• Learn generative model of the input data that tries to describe possible patterns and
features of these inputs without the need for labeled outputs.
• Types:
– Clustering Problems
Given unlabeled data, group data points of alike features together
– Association Problems
Find dependencies within data, generate dependency rules for better
decisions
• Modeling p(x)
Steps:
1- Build supervised learning model based on labeled data
2- Expectation step: Label unlabeled data using the model built in 1
3- Maximization step: Retrain the model based on all the data
4- Repeat starting from step 2 until convergence
Reinforcement Learning
• Training Info = evaluations(rewards/penalties)
• Objective of the agent is to get as much reward as possible
What is a pattern?
• A pattern is an abstract object, such as a set of measurements describing a physical
object
Classification VS Regression?
• Both problems tend to learn an unkown function that maps outputs to inputs (function
approximation)
• Classification → Predict a label
• Regression → Predict a quantity
After collecting data and choosing the features which can effectively differentiate between the
classes, a density function is estimated.
Lecture 2
(Training Patterns)
Feature Vector
X1 (m)
X2 (m)
X(m) =
⋮
XN (m)
N → Number of features
m → Represents the mth training pattern
M → Number of training patterns
Decision Regions
• The data points of each class occur in groupings or clusters in the feature space plot
W0 + W1 X1 + … + WN XN = 0
• W 0 and W i are the constant that determine the position of the hyperplane
Types of Problems
A problem is said to be linearly separable if there is a hyperplane that can separate the
training data points of class C 1 from those of C 2
Lecture 3
(Pattern Classification Methods)
Minimum Distance Classifier
Steps:
• Choose a center or representative pattern from each class V(k)
• Give a pattern X that we would like to classify
• Compute the Euclidean Distance from X to each center in V(k)
Mi
1
V(i) = ∑ X(m)
Mi m=1
Disadvantages:
• Too simple to solve difficult problemes
• outlier affects the position if the mean of the class badly
• poor performance if there are overlapping between classes
Steps:
• Compute the distance between pattern X and each pattern X(m) in the training set
• The class of the pattern m that corresponds to the minimum distance is chosen as the
classification of X
Advantages:
• simplicity
Disadvantages:
• Sensitive to outliers
• Patterns with large overlaps between the class can negatively affect performance
K-Nearest Neighbor Classifier (KNN)
** (if number of classes are odd → K =??)
Same to NN but taking k-nearest points into consideration
Less dependent on strange patterns compared to the nearest neighbour classification rule
Disadvantages:
• The neighbours could be a bit far away from X leading to using information that might
not be relevant
Lecture 4
(Bayes Classification Rule)
→ It classifies the data point (belonging to the most likely class) that makes it an optimal
classification rule
→ However, bayes classifier assumes that probability densities are known, which is not
usually the case.
→ Note: The a priori probabilities represent the frequencies of the classes irrespective of the
observed features (we calculate it from the training data labels)
→ Bayes classifier is a linear classifier iff the covarience matrix of all classes are the same.
Steps:
Given a pattern X (with unkown class) that we wish to classify
• Compute P(C 1 |X), P(C 2 |X), ........ P(C k |X) (we wish to calculate but we cant so we
use bayes rule)
• Find the k giving maximum P(C k |X)
Rule:
P(Ci , X)
P(Ci |X) =
P(X)
P(X|Ci ) P(Ci )
= (1)
P(X)
PCorrect:
PError:
P(error) = 1 - P(correct)
(X - u ) 2
X -
**𝜇i = E(X) = ∫ e 2𝜎 2 dx
2𝜋𝜎
Variance = E (X - 𝜇) 2 = 𝜎2
(x - u i ) 2
1 -
2𝜎 2
P(X|Ci ) = e
2𝜋𝜎
Multi-dimensions
Independent Case
1 N (X j - 𝜇 j ) 2
- ∑ j=1
2 2𝜎j2
e
P(X |Ci ) = N
(2𝜋) 2 𝜎 1 𝜎 2 ... 𝜎 N
Dependent Case
1
- (X - 𝜇 ) T Σ -1 (X - 𝜇 )
e 2
P(X|Cj ) = N 1
(2𝜋) det 2 (Σ)
2
all cov matrix are zeros except the main diagonal
Lecture 5
Density Estimation
Probability densities have to be estimated to apply bayes rule
Histogram Analysis
m
p(x) =
M * sizeOfBin
Naive Estimator
--more accurate than histogram analysis--
h h
#points falling in X - 2 , X - 2
P(X) =
Mh
• Discontinuity of the density estimates
• All data points are weighted equally regardless of their distance to the estimation point
-x 2
e 2h 2
𝜙h (x) =
2𝜋h
M
1
P(X|Ci) = ∑ 𝜙h (X - X(m))
M m=1
• 𝜙 h does not have to be gaussian
1 X
𝜙h = g
h h
• g(.) should integrate to 1 & known distribution
• Check Parzen window estimator as a bump function
• Note:
– Naïve estimator is equivalent to a Parzen window estimator with:
• Multi-dimension Form:
How to choose h
1
N+4
4
Hi = 𝜎i
(N + 2)M
𝜎i = [Σx ]i,i
M
1
Σx = ∑ (X(m) - 𝜇)(X(m) - 𝜇) T
M m=1
N
1
hopt = ∑ Hi
N i=1
Lecture 6
Feature Selection & Extraction
• If we use too many features we will suffer from curse of dimensionality
• Generalization ability: Is the ability of the classifier to generalize well for data it had
not seen before
Curse Of Dimensionality:
• The parameters’ estimates (mean vector and covariance matrix) will not be very
accurate
• The data points will become scattered and clustering of classes will not be clear
• Wrapper Type: select features by taking into consideration the classifier you will
use(SFS & SBS)
– Advantages → Accuracy
– Disadvantages →Slow Exectuion & Lack of Generality
Feature Extraction
• Transform the available N features into a smaller no. of L features through certain
transformations (usually linear)
• Transformed features may have no physical meaning → explaining the model is
problematic
• May not be suitable to every domain
u1T
u2T
Z= Y
⋮
uLT
𝑍 = 𝐴𝑌
Cov(Z) = 𝐴Σ𝐴 T
= U T ΣU
= Ω
Σ𝐔 = 𝑼𝛀
𝑈 T 𝛴𝑈 = 𝛺
Lecture 7
Classifiers Combinations
Why classifiers combination?
Best classifier for the training set might not be the best for the test set, so it is risky to choose
the best in training set. As a result, it is better to combine classifiers with high diversity.
Ways to combine
• Majority Vote
• Average class posterior propabilities
• Average some sort of score function
• Can also choose median instead of average
Adaboost
• Iterative procedure
• Tries approximate the bayes classifier by combining many best weak classifiers to
create a strong classifier
• AdaBoost works well only in case of binary classification
• AdaBoost assumes that the error of each weak classifier is less than 0.5
– 𝛼 t is negative if error is greater than 0.5 (log(<0) = negative number )
– The error of random guessing in case of two classes is 0.5
k-1
– In case of K multi classes the random guessing error rate is
k
1 - err
𝛼t = log + log(k - 1)
err
∴ 𝛼t is positive only
1
(1 - err) >
k
Weak Classifier: is able to guess the right class with a percentage slightly bigger than
random guessing
1
wm =
M
2- For t=1 to T:
• Select a classifier h t that best fits to the training data using weights w m of the training
examples
∑ M wm (cm ≠ ht (xm ))
m=1
• Compute error of h t as: err t =
∑ M wm
m=1
(1 - errt )
• Compute weight of classifier: 𝛼 t = log
errt
Overfitting:
• AdaBoost is robust to overfitting given that select the best weak classifiers
• However, relying on complex classifiers will be more prone to overfitting
Lecture 8
GMM
Why GMM?
•Assume we have a small data set, not possible to estimate class conditionals using kernel
density estimator. Instead, we model each class conditional as a sum of multivariate
Gaussian densities.
1 -1
K - (X-𝜇 ) T ∑ j (X-𝜇 )
e 2
∑ wj
j j
P( X ) = N 1
j=1
(2𝜋) 2 det 2 ∑
j
K
= ∑ w j N X, 𝜇 j , Σ j
j=1
K
∑ wj = 1
j=1
Issues with GMM
1. Initialization
• Expectation–Maximization (EM) an iterative algorithm which is very sensitive to initial
conditions. (Start from trash →end up with trash)
Lecture 9
Decision Trees & Random Forests
Building a Decision Tree
Entropy
H(S) = - pyes lg(pyes ) - pno lg(pno )
Information Gain
|Sv |
Gain(S, X) = H(S) - ∑ H(Sv )
v 𝜖 values(X)
|S|
Overfitting
If we split the decision tree until all training examples are correctly classified, all leaf nodes
will be pure, even if they have just one example (singletons), this will cause overfitting and
the model can't generalize on new data
To avoid overfitting:
Way 1:
Stop splitting when not statistically significant
Gain Ratio
|Sv | |Sv |
SplitEntropy(S, X) = ∑ lg
v 𝜖 values(X)
|S| |S|
Gain(S, X)
GainRatio(S, X) =
SplitEntropy(S, X)
Continuous Attributes
• Continuous attributes can be repeated, unlike discrete attributes
• Real values of attributes are sorted and average of each two adjacent examples is a
threshold to be considered
Entropy in muilti - class classification
H(S) = - ∑ pi lg(pi )
i
Regression:
• Predicted output → avg. of training examples in the subset (or linear regression at
leaves)
• Minimize variance in subsets (instead of maximize gain)
Replacement means that a selected item can be reselected multiple times for the same
subset
Random Forest
Steps:
• M is number of traning examples
• Uniformly sample T subsets(each of size M) with replacement
• Build T decision trees with zero training error
• Take the average/votes of T trees
Building Tree:
• D is the number of features
• Sample K features randomly (K < D)
• Only split on these K random features
• New K features are sampled for every single split
Notes:
- This means that different trees are built so different mistakes by each tree
- No need to tune hyperparameters
- No need to pre-process or scale inputs
- Increase T as much as you can afford (Parallel Processing)
- No need for training/validation split
- Can estimate test error directly from training set
- Second best approach
- Not suitable for raw images
- Improvement: Prune the last split of trees to decrease the size of trees and decrease noise
- Computer error for each training example (Consider only trees that do not include that
example in their training subset 60%)
K = ⏲⏳ D⏴⏵
⏳ ⏵
Error Calculations:
M T
1
EOFB = ∑ 1 ∑ loss(hj (xi ), yi ) (2)
M i=1 Zi (xi, ,yi )
T
Zi = ∑ 1
j
(xi ,yi )∉ Sj
Neural Networks
• Weights are the parameters that encode the information in the brain
• Brain is superior because of the massive parallelism
• Weights act like the “storage” in computers
• When a human encounters a new experience the weights of his brain gets adjusted
• Memories are encoded in the weights
• The weights determine the functionality of the model
• The neuron can implement a linear classifier
Augmented Vectors:
W0
W =
w
W1
W2
w=
⋮
WN
1
1 X1 (m)
u(m) = =
X(m) ⋮
XN (m)
T
y(m) = f W u(m)
• It can be used for non linearly separable problems as long as it produces a low
classification error rate
2- Present the augmented input (or feature) vector of the m th training 𝑢(𝑚)and its
corresponding desired output d(𝑚)
T
3- Calculate the actual output for pattern m → y(m) = f W u(m)
4- Adapt the weights according to the following rule (called WidrowHoff rule):
𝜂 → Learning rate
5- Go to step 2 until all patterns are classified correctly, i.e.,d(m)=y(m) for m=1, … , M
• If the problem is not linearly separable, then the algorithm will not converge & will keep
cycling forever that is why we need least square classifier (Rosenblatt theorem)
M
E = ∑ W T u(m) - bm
2
m=1
∂E
∂W0
∂E
∂E
– = ∂W1
∂W
⋮
∂E
∂WN
∂E
– = 0 and solve for W
∂W
– Advantages:
* Can converge if the problem is not linearly spearable
– Disadvantages:
* Linear classifiers don't solve all problems
Multi-Layer Network
• The powerful feature of multilayer networks (aka. feed forward networks) is its ability to
learn
• We usually use hidden node fn’s (Activation) that are continuous
Gradient Descent
• We use the concept of steepest descent
• To update the weights
∂E
W(new) = W(old) - 𝛼
∂W
Steps:
1- Initialize the weights and threshold (bias) randomly [-1,1]
2- Present the augmented input (or feature) vector of the m th training 𝑢(𝑚)and its
corresponding desired output d(𝑚)
3- For m=1 to M:
• Present u(m) to the network and compute the hidden layer outputs and final layer
outputs
• Use these outputs in a backward scheme to compute the partial derivatives of error fn.
w.r.t. to the weights of each layer
∂Em
• Update the weights → W i[,Lj ] (new) = W i[,Lj ] (old) - 𝛼
∂Wi[,Lj ]
𝝏𝑬
𝜷𝑫𝑾 + (𝟏 − 𝜷) 𝝏𝑾
𝑫𝑾 =
𝟏−𝜷
W = W - 𝛼DW
DW = 0 initially
• RMSProp
2
𝝏𝑬
𝜷Sw + (𝟏 − 𝜷) 𝝏𝑾
Sw =
𝟏−𝜷
𝝏𝑬
𝝏𝑾
W = W -𝛼
Sw + 𝜖
• Adam
2
𝝏𝑬
𝜷Sw + (𝟏 − 𝜷) 𝝏𝑾
Sw =
𝟏−𝜷
𝝏𝑬
𝜷Dw + (𝟏 − 𝜷) 𝝏𝑾
Dw =
𝟏−𝜷
Dw
W = W -𝛼
Sw + 𝜖
Regularization
• Used to prevent overfitting
• Intuition: set the weights of some hidden nodes to zero to simplify the network
M
1
• L 2 Regularization → ∑ Em + 𝜆 ||W||22
M m=1 2M
M
1
• L 1 Regularization → ∑ Em + 𝜆 ||W||1
M m=1 2M
• L 2 regularization is used more often
• 𝜆 is the regularization parameter (hyper parameter)
Dropout Regularization
• Every epoch → shutdown random number of neurons
• Not all nodes get trained every epoch
• No neuron get overfitted
• Simpler Model
• More generalized
• Adam Parameters
Tuning Process
• Try random values: don’t use a grid
• Coarse to fine scheme (Focus on the good regions)
• Use appropriate scale
– Don't sample uniformly
– Use logarithmic scale
1
a(year) = ∑ x(t)
12 window
x(i)
Z(i) =
a(year)
∑ Zj (i)
j
u(i) =
#years
x(t)
xdeseasonal (t) =
u(month(t))
What would have changed in the previous question if the neural network was solving a
regression problem rather than a classification problem?
M
1
L= ∑ a i[2] - yi 2
(MSE)
M i=1
In case of regression problem →
1 M
L= ∑ (y i - y ) 2 (RMSE)
i
M i=1
What would have changed in the previous question if the neural network was solving a
multi-class classification problem rather than a binary classification problem? Write
down the modified equations.
e zi [L]
yi =
∑ e zj [L]
j
What does the learning rate (alpha parameter) mean? How does changing the learning
rate
affect the training process?
• It controls the amount of apportioned error that the weights of the model are updated
with each time they are updated
• Too small alpha → Convergence will be slow
• Too large alpha → Convergence will oscillate around the minimum
• Good range is between 0.001 and 0.05
∂E
W(new) = W(old) - 𝛼
∂W
What is the difference between batch gradient descent and sequential (stochastic)
gradient
descent?
Disadantages:
• Hard to converge since it depends on every single example
• Loss speedup from vectorization
Adantages:
• Optimization is more consistent
Disadantages:
• Slow (too long per iteration)
Adantages:
• Fast
Mention four different types of activation functions. Write down the mathematical
expression for each of them as well as their derivatives. Mention the advantage(s) and
disadvantage(s) of each of them.
1
Sigmoid →
1 + e -x
Slow learning
Used only in output layer in binary classification problems
Unfortunately it has 0 gradient in some parts
e x - e -x
Tanh →
e x + e -x
What are the hyperparameters in the gradient descent update algorithm? How to
select these hyperparameters?
• Learning rate 𝛼
• Size of the mini batch
• Number of hidden nodes
• Number of layers
Tuning Process:
1- Try random values
2- Corase to fine scheme
3- Use appropriate scale (Don't sample uniformly, Use logarithmic scale)
What is the difference between training set, test set and validation set? Are there any
guidelines in selecting each of them?
Training Set: Here, you have the complete training dataset. You can extract features and
train to fit a model and so on.
Validation Set: This is crucial to choose the right parameters for your estimator. We can
divide the training set into a train set and validation set. Based on the validation test results,
the model can be trained(for instance, changing parameters, classifiers). This will help us get
the most optimized model.
Testing Set: Here, once the model is obtained, you can predict using the model obtained on
the training set.
Mention the difference between overfitting and underfitting? Give an example to each
of them.
Underfitting Normal Overfitting
What are different optimization algorithms? State the weight update equation for each
of these optimizers.
∂E
W(new) = W(old) - 𝛼
∂W
W = W - 𝛼DW
DW = 0 initially
• RMSProp
- Slow down learning in unintended directions
- Avoid oscillations
2
𝝏𝑬
𝜷Sw + (𝟏 − 𝜷) 𝝏𝑾i
Swi =
𝟏−𝜷
2
𝝏𝑬
𝜷Sw + (𝟏 − 𝜷) 𝝏𝑾j
Swj =
𝟏−𝜷
𝝏𝑬
𝝏𝑾i
Wi = Wi - 𝛼
Swi + 𝜖
𝝏𝑬
𝝏𝑾j
Wj = Wj - 𝛼
Swj + 𝜖
𝝏𝑬
𝜷Dwi + (𝟏 − 𝜷) 𝝏𝑾i
Dwi =
𝟏−𝜷
Dwi
Wi = Wi - 𝛼
Swi + 𝜖
What is the main difference between gradient descent and gradient descent with
momentum?
With Stochastic Gradient Descent we don’t compute the exact derivate of our loss function.
Instead, we’re estimating it on a small batch. Which means we’re not always going in the
optimal direction, because our derivatives are ‘noisy’.
Gradient descent with momentum smoothes out the steps of the gradient descent using a
moving average of the derivatives, so it avoid oscillations and faster learning, as it uses a
moving out average to take into considereation the previous results.
Mention two different ways to reduce overfitting. Explain how each of them reduces
overfitting.
Regularization
• Used to prevent overfitting
• Intuition: set the weights of some hidden nodes to zero to simplify the network
M
1
• L 2 Regularization → ∑ Em + 𝜆 ||W||22
M m=1 2M
M
1
• L 1 Regularization → ∑ Em + 𝜆 ||W||1
M m=1 2M
• L 2 regularization is used more often
• 𝜆 is the regularization parameter (hyper parameter)
Dropout Regularization
• Every epoch → shutdown random number of neurons
• Not all nodes get trained every epoch
• No neuron get overfitted
• Simpler Model
• More generalized
MCQ Questions:
1. Which hyperparameter of the following needs to be tuned first in a typical neural network
problem?
a. Momentum parameter.
b. Mini batch size
c. Learning rate
d. Number of hidden nodes in each layer.
5. The neural network that tries to match two given inputs and detect how similar or different
they are from each other is called
a. Convolutional neural network.
b. Siamese network.
c. Recurrent neural network.
d. Generative Adversarial Network.
6. A network with a skip connection from output layer to input layer is called
a. Convolutional neural network.
b. Siamese network
c. Recurrent Neural Network.
d. Generative Adversarial Network.
7. A neural network used mainly to generate features from input images and represent these
images in a compressed low dimensional space is called:
a. Convolutional neural network.
b. Auto Encoder network.
c. Recurrent Neural Network.
d. Siamese Network.
8. If the neural network is subject to overfitting, then we can reduce the effect of overfitting by:
a. Increasing the size of the training data.
b. Increasing the size of the neural network.
c. L2- Regularization.
d. Dropout regularization.
10. For the neural network to learn functions such as XOR and XNOR, it is sufficient to have:
a. 1 input layer and 1 output layer.
b. 1 input layer, 1 hidden layer, 1 output layer.
c. 1 input layer, 2 hidden layers, 1 output layer.
d. It is dependent on the number of inputs, and so it is impossible to tell.
QUESTIONS
Define the Machine Learning Recipe
Underfitting Normal Overfitting
Weak Classifier: is able to guess the right class with a percentage slightly bigger than
random guessing
1 - err
𝛼t = log + log(k - 1)
err
∴ 𝛼t is positive only
1
(1 - err) >
k
What’s the difference between wrapper method and filter method for feature selection?
Methods of Feature Selection:
• Filter type: select features without looking into the classifier you are going to use.
– Advantages → Faster Execution & Generality
– Disadvantages → Tendency to select large subsets
• Wrapper Type: select features by taking into consideration the classifier you will use(
SFS & BFS)
– Advantages → Accuracy
– Disadvantages →Slow Exectuion & Lack of Generality
• Deep Learning
– Multi-layered neural networks
– end-to-end solution (Works on raw data)
– Required big data and high computational power
If u train a neural network and get 54% training accuracy and 51% validation accuracy,
explain what u will do next
This means the neural network is underfitted so to overcome this problem:
• Use larger network (more nodes or more layers)
• Traing for longer time
𝜂 → Learning rate
T
y(new) = f Wnew u(m)
T
y(new) = f( Wold . u(m) + 𝜂[d(m) - y(m)]u T (m)u(m)
T
y(new) = f( Wold . u(m) + 𝜂[1 - 0]u T (m)u(m)
T
y(new) = f( Wold . u(m) + 𝜂||u(m)|| 2
Explain the naive estimator method, write the formula used and compare it to
histogram analysis
Naive Estimator
h h
#points falling in X - 2 , X + 2
P(X) =
Mh
• Discontinuity of the density estimates
• All data points are weighted equally regardless of their distance to the estimation point
Histogram Analysis
m
p(x) =
M * sizeOfBin
Explain the kernel density estimation technique with multidimensional case equations.
To apply the kernel density estimation technique, we need the following:
1. Bump Function (g(x))
• We can model the bump function as follows in the multidimensional case
• If the problem is not linearly separable, then the algorithm will not converge & will keep
cycling forever that is why we need least square classifier (Rosenblatt theorem)
M
E = ∑ W T u(m) - bm
2
m=1
• We then seek to minimize the error function by finding W that minimizes the E
• Define the gradient vector
∂E
∂W0
∂E
∂E
– = ∂W1
∂W
⋮
∂E
∂WN
∂E
– = 0 and solve for W
∂W
– Advantages:
* Can converge if the problem is not linearly spearable
– Disadvantages:
* Linear classifiers don't solve all problems
1. Initialization
• Expectation–Maximization (EM) an iterative algorithm which is very sensitive to initial
conditions. (Start from trash →end up with trash)
Regularization
• Used to prevent overfitting
• Intuition: set the weights of some hidden nodes to zero to simplify the network
M
1
• L 2 Regularization → ∑ Em + 𝜆 ||W||22
M m=1 2M
M
1
• L 1 Regularization → ∑ Em + 𝜆 ||W||1
M m=1 2M
• L 2 regularization is used more often
• 𝜆 is the regularization parameter (hyper parameter)
Dropout Regularization
• Every epoch → shutdown random number of neurons
• Not all nodes get trained every epoch
• No neuron get overfitted
• Simpler Model
• More generalized
Input & Output Regularization
• Inputs have to be approximately in the range of 0 to 1 or -1 to 1
It is an optimal classification rule, The reason is that it chooses the most likely class so
nothing could be better
2- Present the augmented input (or feature) vector of the m th training 𝑢(𝑚)and its
corresponding desired output d(𝑚)
3- For m=1 to M:
• Present u(m) to the network and compute the hidden layer outputs and final layer
outputs
• Use these outputs in a backward scheme to compute the partial derivatives of error fn.
w.r.t. to the weights of each layer
∂Em
• Update the weights → W i[,Lj ] (new) = W i[,Lj ] (old) - 𝛼
∂Wi[,Lj ]
Adantages:
• Faster in update compared to gradient descent
Disadantages:
• Hard to converge since it depends on every single example
• Loss speedup from vectorization
Adantages:
• Optimization is more consistent
Disadantages:
• Slow (too long per iteration)
Adantages:
• Fast
Describe how CNNs work, and why they have less memory footprints.
It is mostly applied to imagery problems
- Layers extract features from input images,
- Convolution layer, i.e., filtering
- Pooling Layer, i.e., reduce input (avg or max)
- Fully Connected Layer, i.e., as in multi layer NN, at the final layers
7. Combinatorial Explosion
• Real world images are combinatorial large
• Application dependent (e.g., medical imaging is an exception)
• Considering compositionality may be a potential solution
• Testing is challenging (consider worst case scenarios)
In this technique, the parameter K refers to the number of different subsets that the given
data set is to be split into. Further, K-1 subsets are used to train the model and the left-out
subsets are used as a validation set.
7. Repeat the above step K times i.e., until the model is not trained and tested on all subsets
8. Generate overall prediction error by taking the average of prediction errors in every case