0% found this document useful (0 votes)
5 views

Machine Learning

Uploaded by

balavishnuprabha
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Machine Learning

Uploaded by

balavishnuprabha
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 87

AL3451-MACHINE LEARNING

Review of Linear Algebra for machine learning

Linear Algebra:
Linear Algebra is an essential field of mathematics,
which defines the study of vectors, matrices, planes,
mapping, and lines required for linear transformation.
It enables ML algorithms to run on a huge number
of datasets.

Benefits of learning Linear Algebra before


Machine learning:
• Better Graphic experience
• Improved Statistics
• Creating better Machine Learning algorithms
• Estimating the forecast of Machine Learning
• Easy to Learn
Definition of Scalar, Vector, Matrix, and Tensor

Scalar: a single numbe


Vector: an one-dimensional array of
numbers
Matrix: a two-dimensional array of
number Tensor: a multi-dimensional
Implementation of Linear Regression in ML
• To implement them, we can
use NumPy array np.array() in python.
• scalar = 1
• vector = np.array([1,2])
• matrix = np.array([[1,1],[2,2]])
• tensor = np.array([[[1,1],[2,2]], [[3,3],[4,4]]])
Matrix in ML
Examples of linear algebra in Machine learning
• Datasets and Data Files
• Linear Regression
• Recommended Systems
• One-hot encoding
• Regularization
• Principal Component Analysis
• Images and Photographs
• Singular-Value Decomposition
• Deep Learning
• Latent Semantic Analysis
• Neural Network
Introduction and motivation for machine learning
Machine learning is programming computers to optimize a performance
criterion using example data or past experience.

1. In training, we need efficient algorithms to solve the optimization problem,


as well as to store and process the massive amount of data we generally
have.

2. once a model is learned, its representation and algorithmic solution for


inference needs to be efficient as well.

In certain applications, the efficiency of the learning or inference algorithm,


namely, its space and time complexity, may be as important as its predictive
accuracy.
Machine learning (ML) is a branch of artificial intelligence (AI) that enables
computers to “self-learn” from training data and improve over time,
without being explicitly programmed. Machine learning algorithms are able
to detect patterns in data and learn from them, in order to make their own
predictions.
Examples of machine learning applications
Types of Machine Learning

1. UnSupervised Learning
2. supervised Learning
3. Reinforcement Learning
Vapnik-Chervonenkis (VC) dimension
Vapnik–Chervonenkis dimension) is a measure of the capacity
of a statistical classification algorithm, defined as the
cardinality of the largest set of points that the algorithm can
shatter.

Let us say we have a dataset containing N points.


These N points can be labeled in 2N ways as positive and
negative.
Therefore, 2N different learning problems can be defined by N
data points.

If for any of these problems, we can find a hypothesis h ∈ H


that separates the positive examples from the negative, then
we say H shatters N points.
Probably Approximately Correct (PAC) learning
PAC Continued...
Hypothesis spaces

• It is just a guess based on


some known facts but has
not yet been proven.
• A good hypothesis is
testable, which results in
either true or false.
Inductive bias
• Need to make assumptions
• Experience alone doesn’t allow us to make conclusions about unseen
data instances.
• Two types of bias:
»Restriction: Limit the hypothesis space.
»Preference: Impose ordering on hypothesis space
Inductive bias - definition
UNIT II SUPERVISED LEARNING

• Linear Regression Models: Least squares, single &


multiple variables, Bayesian linear regression, gradient
descent, Linear Classification Models: Discriminant
function – Perceptron algorithm, Probabilistic
discriminative model - Logistic regression, Probabilistic
generative model – Naive Bayes, Maximum margin
classifier – Support vector machine, Decision Tree,
Random Forests
• 1.Assume a disease so rare that it is seen in only one
person out of every million. Assume also that we have a
test that is effective in that if a person has the disease,
there is a 99 percent chance that the test result will be
positive; however, the test is not perfect, and there is a
one in a thousand chance that the test result will be
positive on a healthy person. Assume that a new patient
arrives and the test result is positive. What is the
probability that the patient has the disease?
Linear Regression Models:
Learning:
– A supervised algorithm that learns from a set of training samples.
– Each training sample has one or more input values and a single
output value.
– The algorithm learns the line, plane or hyper-plane that best fits
the training samples.
Prediction
– Use the learned line, plane or hyper-plane to predict the output
value for any input sample.
Least squares
• It is a mathematical
method used to find the
best fit line that represents
the relationship between
an independent and
dependent variable.

• the distance between the


line of best fit and the error
must be minimized as
much as possible.
single & multiple variables:

Single Variable Linear Regression is a technique used to


model the relationship between a single input independent
variable (feature variable) and an output dependent
variable using a linear model i.e a line.
Multi-Variable Linear Regression

Multi-Variable Linear Regression where a model is created for


the relationship between multiple independent input variables
(feature variables) and an output dependent variable.
Bayesian linear regression
• The aim of Bayesian Linear Regression is not to find
the single “best” value of the model parameters, but
rather to determine the posterior distribution for the
model parameters.
• The posterior probability of the model parameters is
conditional upon the training inputs and outputs:

P(β|y, X) -- posterior probability distribution

P(β|X) -- prior probability of the parameters

P(y| X) -- normalization constant


• Two primary benefits of Bayesian Linear Regression

– Priors: If we have domain knowledge, or a guess for what the model


parameters should be, we can include them in our model, unlike in the
frequentist approach which assumes everything there is to know about the
parameters comes from the data.

– Posterior: The result of performing Bayesian Linear Regression is a


distribution of possible model parameters based on the data and the prior.
This allows us to quantify our uncertainty about the model: if we have
fewer data points, the posterior distribution will be more spread out.
gradient descent

Eg: Imagine a valley and a person


with no sense of direction who wants
to get to the bottom of the valley. He
goes down the slope and takes large
steps when the slope is steep and
small steps when the slope is less
steep. He decides his next position
based on his current position and
stops when he gets to the bottom of
the valley which was his goal.
• Gradient descent is an iterative optimization algorithm to find the
local minimum of a function.
• It can be used to minimize an error function in neural networks in
order to optimize the weights of the neural network.
Linear Classification Models:
• A linear classifier does classification decision based on the value of a linear
combination of the characteristics. Imagine that the linear classifier will merge
into it's weights all the characteristics that define a particular class.
The weight matrix will have one row for every class that needs to be classified, and one column for ever
element(feature) of x.On the picture above each line will be represented by a row in our weight matrix.
Weight and Bias Effect
The effect of changing the weight will change the line angle, while changing the bias, will move the line left/right
Discriminant function

• It is used as a dimensionality reduction technique. Also known as a commonly used in


the pre-processing step in machine learning and pattern classification projects.

• In Python, it helps to reduce high-dimensional data set onto a lower-dimensional space.


The goal is to do this while having a decent separation between classes and reducing
resources and costs of computing. https://youtu.be/azXCzI57Yfc
Linear Discriminant Analysis
• three key steps.
1. Calculate the separability between different classes. This is
also known as between-class variance and is defined as the
distance between the mean of different classes.
2. Calculate the within-class variance. This is the distance
between the mean and the sample of every class.
3. Construct the lower-dimensional space that maximizes Step1
(between-class variance) and minimizes Step 2(within-class
variance). In the equation below P is the lower-dimensional
space projection. This is also known as Fisher’s criterion.
Naïve Bayes Classifier Algorithm
• It is a probabilistic classifier, which means it predicts on the basis of the
probability of an object.
• Examples : spam filtration, Sentimental analysis, and classifying articles.

Naïve: It assumes that the occurrence of a certain feature is independent of


the occurrence of other features. Such as if the fruit is identified on the
bases of color, shape, and taste, then red, spherical, and sweet fruit is
recognized as an apple. Hence each feature individually contributes to
identify that it is an apple without depending on each other.

Bayes: It is used to determine the probability of a hypothesis with prior


knowledge. It depends on the conditional probability.
• P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.
• P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true
• P(A) is Prior Probability: Probability of hypothesis before observing the evidence.
• P(B) is Marginal Probability: Probability of Evidence.

• Step for NB Classification:


1. Convert the given dataset into frequency tables.
2. Generate Likelihood table by finding the probabilities of given features.
3. Now, use Bayes theorem to calculate the posterior probability.
Generative approach: is to learn each language and determine as to
which language the speech belongs to
Discriminative approach: is determine the linguistic differences without
learning any language– a much easier task!
Maximum Margin Classifier - SVM

Support Vector Machine


Support Vectors: These are the points
that are closest to the hyperplane. A
separating line will be defined with the
help of these data points.

Margin: it is the distance between the


hyperplane and the observations
closest to the hyperplane (support
vectors). In SVM large margin is
considered a good margin.

Examples: Face detection, image classification, text categorization, etc.


Decision Tree Classification Algorithm
• A decision tree is a hierarchical data
structure implementing the divide-and-
conquer strategy. It is an efficient
nonparametric method, which can be
used for both Classifification and
regression.
• Decision Trees usually mimic human
thinking ability while making a decision,
so it is easy to understand.

– internal nodes --> features of a dataset


– branches --> decision rules
– leaf node --> outcome.
A decision tree is a hierarchical model for
supervised learning whereby decision tree
the local region is identifified in a sequence of
recursive splits in a smaller number of steps.
A decision tree is composed of internal
decision nodes and terminal leaves (see
fifigure 9.1). Each decision node m
implements a decision node test function
fm(x) with discrete outcomes labeling the
branches. Given an input, at each node, a
test is applied and one of the branches is
taken depending on the outcome. This
process starts at the root and is repeated
recursively until a leaf node is hit, at which
point the value written in the leaf node
leaf constitutes the output.
Random Forest Algorithm
• Random Forest is a classifier that contains a number of decision trees on
various subsets of the given dataset and takes the average to improve the
predictive accuracy of that dataset.
• The greater number of trees in the forest leads to higher accuracy and
prevents the problem of overfitting.
UNIT III ENSEMBLE TECHNIQUES AND UNSUPERVISED
LEARNING

• Combining multiple learners: Model combination


schemes, Voting, Ensemble Learning - bagging, boosting,
stacking,

• Unsupervised learning: K-means, Instance Based


Learning: KNN, Gaussian mixture models and
Expectation maximization.
Combining multiple learners

• No one single algorithm is always the most accurate.


• So many models are composed of multiple learners that
complement each other .
• By combining them, we attain higher accuracy.

• the combining of models is done by using two approaches


namely “Ensemble Models” & “Hybrid Models”.
Model combination schemes
• Mixture of experts, is an ensemble learning technique that
implements the idea of training experts on subtasks of a
predictive modeling problem.
• There are four elements to the approach, they are:

1. Division of a task into subtasks.


2. Develop an expert for each subtask.
3. Use a gating model to decide which expert to
use.
4. Pool predictions and gating model output to
make a prediction.
Voting

• In Voting Classifiers, multiple models of the different


machine learning algorithms are present, to whom
the whole dataset is fed, and every algorithm will
predict once trained on the data.

• Once all the models predict the sample data, the


most frequent strategy is used to get the final
prediction from the model.

• Here, the category most predicted by the multiple


algorithms will be treated as the final prediction of
the model.
Ensemble Learning
Ensemble of classifiers is a set of classifiers whose individual decisions combined in some way
to classify new approach.

Simplest approach:
1. Generate multiple classifiers
2. Each votes on test instance
3. Take majority as classification

Classifiers different due to different sampling of training data, or randomized parameters within
the classification algorithm

Aim: take simple mediocre algorithm and transform it into a super classifier without requiring
any fancy new algorithm
• In bagging, we use bootstrap sampling to obtain
subsets of data for training a set of base models.

• Bootstrap sampling is the process of using


increasingly large random samples until you achieve
diminishing returns in predictive accuracy.
• Each sample is used to train a separate decision
tree, and the results of each model are aggregated.
• For classification tasks, each model votes on an
outcome.
• In regression tasks, the model result is averaged.
Base models with low bias but high variance are
well-suited for bagging.
• Random forest, which are bagged combinations of
decision trees, are the canonical example of this
approach.
BOOSTING

In boosting, we improve performance by


concentrating modeling efforts on the data that
results in more errors (i.e., focus on the hard
stuff).

We train a sequence of models where more


weight is given to examples that were
misclassified by earlier iterations.

Base models with a low variance but high bias


are well-adapted for boosting.

Gradient Boosting is a famous example of this


approach.
STACKING

In stacking, we create an ensemble function


that combines the outputs from multiple base
models into a single score.

The base-level models are trained based on a


complete dataset, and then their outputs are
used as input features to train an ensemble
function.
AdaBoost Algorithm
• It builds a model and gives equal weights to all the data points.
• It then assigns higher weights to points that are wrongly classified.
• Now all the points which have higher weights are given more importance in
the next model.
• It will keep training models until and unless a lowe error is received.
Unsupervised learning: K-means
• It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a
way that each dataset belongs only one group that has similar properties.

• The k-means clustering algorithm mainly performs two tasks:


– Determines the best value for K center points or centroids by an iterative process.
– Assigns each data point to its closest k-center. Those data points which are near to the particular k-center,
create a cluster.
Instance-based learning
• It simply store the training examples instead of learning explicit description of the target
function.
• Generalizing the examples is postponed until a new instance must be classified.
• When a new instance is encountered, its relationship to the stored examples is examined in
order to assign a target function value for the new instance.
• Instance-based methods are sometimes referred to as lazy learning methods because they
delay processing until a new instance must be classified.

• A key advantage of lazy learning is that instead of


estimating the target function once for the entire instance
space, these methods can estimate it locally and
differently for each new instance to be classified.

• Instance-based learning includes KNN, Gaussian mixture


models and Expectation maximization.
K-NN algorithm
• K-NN algorithm assumes the similarity between the new case/data and
available cases and put the new case into the category that is most similar
to the available categories.
• K-NN algorithm stores all the available data and classifies a new data point
based on the similarity. This means when new data appears then it can be
easily classified into a well suite category by using K- NN algorithm.

As we can see the 3 nearest neighbors are from category


A, hence this new data point must belong to category A.
Gaussian mixture models
• They are used to classify data into different categories based on
the probability distribution.
• Gaussian mixture models can be used when data is generated by
a mix of Gaussian distributions when there is uncertainty about the
correct number of clusters, and when clusters have different
shapes.
Eg: Finding patterns in medical
datasets, Modeling natural
phenomena, Customer behavior
analysis, Stock price prediction,
Gene expression data analysis
Expectation-Maximization (EM) Algorithm
• The Expectation-Maximization (EM) algorithm is an iterative way to find maximum-likelihood estimates
for model parameters when the data is incomplete or has some missing data points or has some
hidden variables.
• EM chooses some random values for the missing data points and estimates a new set of data. These
new values are then recursively used to estimate a better data, by filling up missing points, until the
values get fixed.

1. Expectation step (E - step): It involves the estimation


(guess) of all missing values in the dataset so that
after completing this step, there should not be any
missing value.

2. Maximization step (M - step): This step involves the


use of estimated data in the E-step and updating the
parameters.

3. Repeat E-step and M-step until the convergence of


the values occurs.
UNIT IV NEURAL NETWORKS

Multilayer perceptron, activation functions, network training


– gradient descent optimization – stochastic gradient
descent, error backpropagation, from shallow networks to
deep networks –Unit saturation (aka the vanishing gradient
problem) – ReLU, hyperparameter tuning, batch
normalization, regularization, dropout.
(1) a set of weighted inputs wi
that correspond to the synapses
(2) an adder that sums the input
signals (equivalent to the
membrane of the cell that
collects electrical charge)
(3) an activation function (initially
a threshold function) that decides
whether the neuron fires
(‘spikes’) for the current inputs
The Perceptron is nothing
more than a collection of
neurons together with a set
of inputs and some weights
to fasten the inputs to the
neurons.
Multi-Layer Perceptron
• Multi-layer perception is fully connected dense layers, which transform
any input dimension to the desired dimension. A multi-layer perception is
a neural network that has multiple layers. To create a neural network we
combine neurons together so that the outputs of some neurons are
inputs of other neurons.

MLP uses
backpropagation for
training the network
ACTIVATION FUNCTIONS

The activation function of the node defines the output of the node. There are 4
most popular activation function:

1. Sigmoid function – Better than step function, it also limits the output from 0
to 1, but it smoothens the value. It is also called probabilities, it is a
continuous function. When we have binary problems, we use sigmoid
function.

2. Tanh function – similar to sigmoid, it limits the function from -1 to 1.

3. Step function – It restricts the value of output to 0 and 1.

4. Rectified linear unit – ReLU is like half of step function, it suppresses the
negative values. It is the most popular and utilized function.
NETWORK TRAINING
1. First an ANN will require a random weight initialization
2. Split the dataset in batches (batch size)
3. Send the batches 1 by 1 to the GPU
4. Calculate the forward pass (what would be the output with the
current weights)
5. Compare the calculated output to the expected output (loss)
6. Adjust the weights (using the learning rate increment or
decrement) according to the backward pass (backward gradient
propagation).
7. Go back to square 2
GRADIENT DESCENT OPTIMIZATION
• Gradient Descent is known as one of the most commonly
used optimization algorithms to train machine learning models
by means of minimizing errors between actual and expected
results.
• It helps in finding the local minimum of a function.

• The best way to define the local minimum or local maximum :

– If we move towards a negative gradient or away from the gradient


of the function at the current point, it will give the local minimum of
that function.

– Whenever we move towards a positive gradient or towards the


gradient of the function at the current point, we will get the local
maximum of that function.
STOCHASTIC GRADIENT DESCENT

Stochastic gradient descent is an optimization


algorithm often used in machine learning applications
to find the model parameters that correspond to the
best fit between predicted and actual outputs.

It runs one training example per iteration.

As it requires only one training example at a time,


hence it is easier to store in allocated memory.

It is more efficient for large datasets.


ERROR BACKPROPAGATION
• The main goal is to compute the gradient of the error function.
• for each data point the error function is computed by passing a
label data point through the network (feed forward).
• Next, the gradients are calculated starting from the final layer and
then through use of the chain rule, the gradients can be passed
backwards to calculate the gradients in the previous layers.
• The goal is to get the gradients for the loss function with respect
to each model parameter (weights for each neural node
connection as well as the bias weights).
FROM SHALLOW NETWORKS TO DEEP
NETWORKS
Shallow Neural Networks: Deep Neural Networks : It is the developed form
“Shallow” have just 3 layer for neurons. of neural networks and its hidden layers are more
than one.
1. Input layer
2. Hidden layer (Math processes is happening here)
3. Output layer (Our statistical result)
UNIT SATURATION (AKA THE VANISHING
GRADIENT PROBLEM)
• the vanishing gradient problem is
encountered when training artificial neural
networks with gradient-based learning
methods and backpropagation.

• In such methods, during each iteration of


training each of the neural network’s weights
receives an update proportional to the partial
derivative of the error function with respect to
the current weight.

• [1] The problem is that in some cases, the


gradient will be vanishingly small, effectively
preventing the weight from changing its
value.

• [1] In the worst case, this may completely


stop the neural network from further training.
RELU
• A Rectified Linear Unit is a form of
activation function
• In essence, the function returns 0 if it
receives a negative input, and if it
receives a positive value,the function will
return back the same positive value.

• The function is understood as:


• f(x)=max(0,x)

• The rectified linear unit, or ReLU, allows


for the deep learning model to account
for non-linearities and specific interaction
effects.
HYPERPARAMETER TUNING
The set of parameters that are used to control the behaviour of the model/algorithm and adjustable in order
to obtain an improvised model with optimal performance is so-called Hyperparameters.

Hyperparameter refers to those parameters that cannot be directly learned from the regular training process.

Ex:

• Number of leaves, bins, or depth of a tree


• Number of iterations
• Number of latent factors in a matrix
• Learning rate
• Number of hidden layers in a deep NN
• The number of clusters in k-means clust
BATCH NORMALIZATION
• Batch normalization is a
technique for training very deep
neural networks that
standardizes the inputs to a
layer for each mini-batch.

• This has the effect of stabilizing


the learning process and
dramatically reducing the
number of training epochs
required to train deep networks.
REGULARIZATION
Regularization is a technique used to reduce the errors by
fitting the function appropriately on the given training set and
avoid overfitting.
DROPOUT
Dropout refers to data, or noise, that's intentionally dropped from a neural network to improve
processing and time to results.

It refers to dropping out the nodes (input and hidden layer) in a neural network

Dropout is a regularization method that approximates training a large number of neural networks
with different architectures in parallel.
Unit V Design And Analysis Of Machine Learning

Experiments 8 Guidelines For Machine Learning


Experiments, Cross Validation (Cv) And Resampling – K-
Fold Cv, Bootstrapping, Measuring Classifier Performance,
Assessing A Single Classification Algorithm And Comparing
Two Classification Algorithms – T Test, Mcnemar’S Test, K-
Fold Cv Paired T Test.
GUIDELINES FOR MACHINE LEARNING EXPERIMENTS
• Aim of the Study : what are the objectives (e.g assessing the expected error of an
algorithm, comparing two learning algorithm on a particular problem, etc. )

• Selection of the Response Variable : what should we use as the quality measure (e.g
error, precision and recall, complexity, etc. )

• Choice of Factors and Levels : what are the factors for the defined aim of the study
( factors are hyperparameters when the algorithm is fix and want to find best
hyperparameters, If we are comparing algorithms, the learning algorithm is a factor )

• Choice of Experimental Design : use factorial design unless we are sure that the factors
do not interact
• replication number depends on the dataset size; it can be kept small
• when the dataset is large avoid using small datasets (if possible) which leads to responses
with high variance, and the differences will not be significant and results will not be
conclusive
• Performing the Experiment : doing a few trial runs for some random settings
to check that all is expected, before doing the factorial experiment.

• Statistical Analysis of the Data : conclusion we get should not be due to


chance.

• Conclusions and Recommendations :


• one frequently conclusion is the need for further experimentation
• there is always a risk that our conclusions be wrong, especially if the data is
small and noisy
• When our expectations are not met, it is most helpful to investigate why they
are not
SETTING UP YOUR DATA
• Once you have cleaned your dataset, the next job is to split the data
into two segments for testing and training.

• It is very important not to test your model with the same data that you
used for training.

• The ratio of the two splits

• training data - 70 percent to 80 percent.


• test data - 20 percent to 30 percent.
CROSS VALIDATION (CV)
Cross-validation is a technique in which we train our model using the subset of the data-set and
then evaluate using the complementary subset of the data-set.

three steps involved in cross-validation are as follows :

1. Reserve some portion of sample data-set.


2. Using the rest data-set train the model.
3. Test the model using the reserve portion of the data-set.
K-FOLD CV

The data sample is split into 'k' number of smaller samples >>> K-fold Cross Validation.

The general procedure is as follows:

1. Shuffle the dataset randomly.


2. Split the dataset into k groups
3. For each unique group:
1. Take the group as a hold out or test data
set
2. Take the remaining groups as a training
data set
3. Fit a model on the training set and
evaluate it on the test set
4. Retain the evaluation score and discard
the model
4. Summarize the skill of the model using the
sample of model evaluation scores
BOOTSTRAPPING
Bootstrap Sampling is a method that involves drawing of sample data
repeatedly with replacement from a data source to estimate a population
parameter.

Sample N instances from a dataset of size N


with replacement. The original dataset is used
as the validation set.

The probability that we pick an instance is 1/N;

The probability that we do not pick it is 1 − 1/N.


MEASURING CLASSIFIER PERFORMANCE

• Accuracy
• Confusion Matrix
• Precision
• Recall
• F-Score
• AUC(Area Under
the Curve)-ROC
ASSESSING A SINGLE CLASSIFICATION ALGORITHM

One-class algorithms are based on recognition since


their aim is to recognize data from a particular class, and
reject data from all other classes.

• This is accomplished by creating a boundary that


encompasses all the data belonging to the target class

• when a new sample arrives the algorithm only has to check


whether it lies within the boundary or outside

• Accordingly classify the sample as belonging to the target


class or the outlier.
COMPARING TWO CLASSIFICATION ALGORITHMS

• Classification algorithm is a Supervised Learning


technique that is used to identify the category of
new observations on the basis of training data.
• In Classification, a program learns from the given
dataset or observations and then classifies new
observation into a number of classes or groups.
Such as, Yes or No, 0 or 1, Spam or Not Spam, cat
or dog, etc.

• criterion for comparison of the algorithms-

1. Training time
2. Inference time
3. Inference accuracy (F1 score)
T TEST
• A t-test is a type of inferential statistic used
to determine if there is a significant
difference between the means of two
groups, which may be related in certain
features.
• If t-value is large => the two groups belong
to different groups.
• If t-value is small => the two groups belong
to same group.
• three types of t-tests
1. Independent samples t-test: compares the
means for two groups.
2. Paired sample t-test: compares means from
the same group at different times (say, one
year apart).
3. One sample t-test test: the mean of a single
group against a known mean.
MCNEMAR’S TEST
The McNemar Test is a statistical test used to determine if the proportions of categories in
two related groups significantly differ from each other. To use this test, you should have two
group variables with two or more options.

It is to compare the predictive accuracy of two models.


K-FOLD CV PAIRED T TEST

You might also like