0% found this document useful (0 votes)
91 views18 pages

Unit 3 & 4 (p18)

Cluster analysis is a technique used to group objects based on their similarities. It is commonly used in exploratory data analysis and machine learning applications. Dimensionality reduction refers to techniques that reduce the number of input variables in a dataset in order to improve machine learning performance. The Expectation-Maximization (EM) algorithm is an iterative method for finding maximum likelihood estimates of parameters in statistical models, where the model depends on unobserved latent variables. It alternates between performing an expectation (E) step, which computes the expected value of the log-likelihood, and a maximization (M) step, which computes the parameter values that maximize the expected log-likelihood found on the E step.

Uploaded by

Kashif Baig
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
91 views18 pages

Unit 3 & 4 (p18)

Cluster analysis is a technique used to group objects based on their similarities. It is commonly used in exploratory data analysis and machine learning applications. Dimensionality reduction refers to techniques that reduce the number of input variables in a dataset in order to improve machine learning performance. The Expectation-Maximization (EM) algorithm is an iterative method for finding maximum likelihood estimates of parameters in statistical models, where the model depends on unobserved latent variables. It alternates between performing an expectation (E) step, which computes the expected value of the log-likelihood, and a maximization (M) step, which computes the parameter values that maximize the expected log-likelihood found on the E step.

Uploaded by

Kashif Baig
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

UNIT 3

SHORT QUESTIONS TWO MARKS:

q.1 What is cluster analysis?


Ans Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group
(called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task
of exploratory data analysis, and a common technique for statistical data analysis, used in many fields, including pattern
recognition, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine
learning.

q.2 what is dimensionality reduction?


Ans. Dimensionality reduction refers to techniques that reduce the number of input variables in a dataset. ... Large
numbers of input features can cause poor performance for machine learning algorithms. Dimensionality reduction is a
general field of study concerned with reducing the number of input features

q.3 What is the task of the E-step of the EM-algorithm?


Ans. Expectation step (E – step): Using the observed available data of the dataset, estimate (guess) the values of the
missing data.

q.4 What objective function do regression trees minimize?


Ans The Regression Tree Algorithm can be used to find one model that results in good predictions for the new data. We
can view the statistics and confusion matrices of the current predictor to see if our model is a good fit to the data

q.5 What is linear discriminant analysis?


Ans. Linear Discriminant Analysis or Normal Discriminant Analysis or Discriminant Function Analysis is a dimensionality
reduction technique that is commonly used for supervised classification problems. It is used for modelling differences in
groups i.e. separating two or more classes. It is used to project the features in higher dimension space into a lower
dimension space.
For example, we have two classes and we need to separate them efficiently. Classes can have multiple features. Using
only a single feature to classify them may result in some overlapping as shown in the below figure. So, we will keep on
increasing the number of features for proper classification.

Attention reader! Don’t stop learning now. Get hold of all the important Machine Learning Concepts with the Machine
Learning Foundation Course at a student-friendly price and become industry ready.

q.6 what is multidimensional scaling?


Ans. Multidimensional scaling (MDS) is a means of visualizing the level of similarity of individual cases of a dataset. MDS is
used to translate "information about the pairwise 'distances' among a set of objects or individuals" into a configuration of.
points mapped into an abstract Cartesian space.
q.7 What is the task of the M-step of the EM-algorithm?
Ans. Complete data generated after the expectation (E) step is used in order to update the parameters.

q.8 what are the usage,flowchart, advantage and disadvantage of EM algorithm?


Flow chart for EM algorithm –

Usage of EM algorithm –
It can be used to fill the missing data in a sample.
It can be used as the basis of unsupervised learning of clusters.
It can be used for the purpose of estimating the parameters of Hidden Markov Model (HMM).
It can be used for discovering the values of latent variables.

Advantages of EM algorithm –
It is always guaranteed that likelihood will increase with each iteration.
The E-step and M-step are often pretty easy for many problems in terms of implementation.
Solutions to the M-steps often exist in the closed form.
Disadvantages of EM algorithm –
It has slow convergence.
It makes convergence to the local optima only.
It requires both the probabilities, forward and backward (numerical optimization requires only forward probability).
Long questions:

q.1 What is clustering in supervised learning? What is partitioning algorithm?


Ans. Clustering is the task of dividing the population or data points into a number of groups such that data points in
the same groups are more similar to other data points in the same group and dissimilar to the data points in other
groups. It is basically a collection of objects on the basis of similarity and dissimilarity between them.
For ex– The data points in the graph below clustered together can be classified into one single group. We can
distinguish the clusters, and we can identify that there are 3 clusters in the below picture.

It is not necessary for clusters to be spherical. Such as :

Partitioning Methods: These methods partition the objects into k clusters and each partition forms one cluster. This
method is used to optimize an objective criterion similarity function such as when the distance is a major parameter
example K-means, CLARANS (Clustering Large Applications based upon Randomized Search), etc.

q.2 Define K means algorithm with the help of numerical?


Ans. K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset into different
clusters. Here K defines the number of pre-defined clusters that need to be created in the process, as if K=2, there will be
two clusters, and for K=3, there will be three clusters, and so on.
It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way that each dataset
belongs only one group that has similar properties.
It allows us to cluster the data into different groups and a convenient way to discover the categories of groups in the
unlabeled dataset on its own without the need for any training.
It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of this algorithm is to
minimize the sum of distances between the data point and their corresponding clusters.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters, and repeats the process
until it does not find the best clusters. The value of k should be predetermined in this algorithm.
The k-means clustering algorithm mainly performs two tasks:
Determines the best value for K center points or centroids by an iterative process.
Assigns each data point to its closest k-center. Those data points which are near to the particular k-center, create a
cluster.
Hence each cluster has datapoints with some commonalities, and it is away from other clusters.
The below diagram explains the working of the K-means Clustering Algorithm:

How does the K-Means Algorithm Work?


The working of the K-Means algorithm is explained in the below steps:
Step-1: Select the number K to decide the number of clusters.
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Step-7: The model is ready.

Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables is given below:

Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into different clusters. It means here we
will try to group these datasets into two different clusters. We need to choose some random k points or centroid to form
the cluster. These points can be either the points from the dataset or any other point. So, here we are selecting the below
two points as k points, which are not the part of our dataset. Consider the below image:
Now we will assign each data point of the scatter plot to its closest K-point or centroid. We will compute it by applying
some mathematics that we have studied to calculate the distance between two points. So, we will draw a median
between both the centroids. Consider the below image:

From the above image, it is clear that points left side of the line is near to the K1 or blue centroid, and points to the right
of the line are close to the yellow centroid. Let's color them as blue and yellow for clear visualization.

As we need to find the closest cluster, so we will repeat the process by choosing a new centroid. To choose the new
centroids, we will compute the center of gravity of these centroids, and will find new centroids as below:

Next, we will reassign each datapoint to the new centroid. For this, we will repeat the same process of finding a median
line. The median will be like below image:
From the above image, we can see, one yellow point is on the left side of the line, and two blue points are right to the line.
So, these three points will be assigned to new centroids.

As reassignment has taken place, so we will again go to the step-4, which is finding new centroids or K-points.
We will repeat the process by finding the center of gravity of centroids, so the new centroids will be as shown in the below
image:

As we got the new centroids so again will draw the median line and reassign the data points. So, the image will be:

We can see in the above image; there are no dissimilar data points on either side of the line, which means our model is
formed. Consider the below image:
As our model is ready, so we can now remove the assumed centroids, and the two final clusters will be as shown in the
below image:

How to choose the value of "K number of clusters" in K-means Clustering?


The performance of the K-means clustering algorithm depends upon highly efficient clusters that it forms. But choosing
the optimal number of clusters is a big task. There are some different ways to find the optimal number of clusters, but
here we are discussing the most appropriate method to find the number of clusters or value of K. The method is given
below:
Numerical: . https://www.youtube.com/watch?v=K2sBRVCXZqs

q.3 What is expectation maximization algorithm?


Ans. In the real-world applications of machine learning, it is very common that there are many relevant features
available for learning but only a small subset of them are observable. So, for the variables which are sometimes
observable and sometimes not, then we can use the instances when that variable is visible is observed for the purpose
of learning and then predict its value in the instances when it is not observable.
On the other hand, Expectation-Maximization algorithm can be used for the latent variables (variables that are not
directly observable and are actually inferred from the values of the other observed variables) too in order to predict
their values with the condition that the general form of probability distribution governing those latent variables is
known to us. This algorithm is actually at the base of many unsupervised clustering algorithms in the field of machine
learning.
It was explained, proposed and given its name in a paper published in 1977 by Arthur Dempster, Nan Laird, and Donald
Rubin. It is used to find the local maximum likelihood parameters of a statistical model in the cases where latent
variables are involved and the data is missing or incomplete.

Algorithm:
Attention reader! Don’t stop learning now. Get hold of all the important Machine Learning Concepts with the Machine
Learning Foundation Course at a student-friendly price and become industry ready.

Given a set of incomplete data, consider a set of starting parameters.


Expectation step (E – step): Using the observed available data of the dataset, estimate (guess) the values of the missing
data.
Maximization step (M – step): Complete data generated after the expectation (E) step is used in order to update the
parameters.
Repeat step 2 and step 3 until convergence.
The essence of Expectation-Maximization algorithm is to use the available observed data of the dataset to estimate the
missing data and then using that data to update the values of the parameters. Let us understand the EM algorithm in
detail.
Initially, a set of initial values of the parameters are considered. A set of incomplete observed data is given to the system
with the assumption that the observed data comes from a specific model.
The next step is known as “Expectation” – step or E-step. In this step, we use the observed data in order to estimate or
guess the values of the missing or incomplete data. It is basically used to update the variables.
The next step is known as “Maximization”-step or M-step. In this step, we use the complete data generated in the
preceding “Expectation” – step in order to update the values of the parameters. It is basically used to update the
hypothesis.
Now, in the fourth step, it is checked whether the values are converging or not, if yes, then stop otherwise repeat step-
2 and step-3 i.e. “Expectation” – step and “Maximization” – step until the convergence occurs.
Flow chart for EM algorithm –

Usage of EM algorithm –
It can be used to fill the missing data in a sample.
It can be used as the basis of unsupervised learning of clusters.
It can be used for the purpose of estimating the parameters of Hidden Markov Model (HMM).
It can be used for discovering the values of latent variables.
Advantages of EM algorithm –
It is always guaranteed that likelihood will increase with each iteration.
The E-step and M-step are often pretty easy for many problems in terms of implementation.
Solutions to the M-steps often exist in the closed form.
Disadvantages of EM algorithm –
It has slow convergence.
It makes convergence to the local optima only.
It requires both the probabilities, forward and backward (numerical optimization requires only forward probability).

q.5 What is PCA?


Ans. Principal Component Analysis is an unsupervised learning algorithm that is used for the dimensionality reduction
in machine learning. It is a statistical process that converts the observations of correlated features into a set of linearly
uncorrelated features with the help of orthogonal transformation. These new transformed features are called
the Principal Components. It is one of the popular tools that is used for exploratory data analysis and predictive modeling.
It is a technique to draw strong patterns from the given dataset by reducing the variances.
PCA generally tries to find the lower-dimensional surface to project the high-dimensional data.
PCA works by considering the variance of each attribute because the high attribute shows the good split between the
classes, and hence it reduces the dimensionality. Some real-world applications of PCA are image processing, movie
recommendation system, optimizing the power allocation in various communication channels. It is a feature extraction
technique, so it contains the important variables and drops the least important variable.
The PCA algorithm is based on some mathematical concepts such as:
Variance and Covariance
Eigenvalues and Eigen factors
Some common terms used in PCA algorithm:
Dimensionality: It is the number of features or variables present in the given dataset. More easily, it is the number of
columns present in the dataset.
Correlation: It signifies that how strongly two variables are related to each other. Such as if one changes, the other
variable also gets changed. The correlation value ranges from -1 to +1. Here, -1 occurs if variables are inversely
proportional to each other, and +1 indicates that variables are directly proportional to each other.
Orthogonal: It defines that variables are not correlated to each other, and hence the correlation between the pair of
variables is zero.
Eigenvectors: If there is a square matrix M, and a non-zero vector v is given. Then v will be eigenvector if Av is the scalar
multiple of v.
Covariance Matrix: A matrix containing the covariance between the pair of variables is called the Covariance Matrix.

Principal Components in PCA


As described above, the transformed new features or the output of PCA are the Principal Components. The number of
these PCs are either equal to or less than the original features present in the dataset. Some properties of these principal
components are given below:
The principal component must be the linear combination of the original features.
These components are orthogonal, i.e., the correlation between a pair of variables is zero.
The importance of each component decreases when going to 1 to n, it means the 1 PC has the most importance, and n PC
will have the least importance.

Steps for PCA algorithm


Getting the dataset
Firstly, we need to take the input dataset and divide it into two subparts X and Y, where X is the training set, and Y is the
validation set.
Representing data into a structure
Now we will represent our dataset into a structure. Such as we will represent the two-dimensional matrix of independent
variable X. Here each row corresponds to the data items, and the column corresponds to the Features. The number of
columns is the dimensions of the dataset.
Standardizing the data
In this step, we will standardize our dataset. Such as in a particular column, the features with high variance are more
important compared to the features with lower variance.
If the importance of features is independent of the variance of the feature, then we will divide each data item in a column
with the standard deviation of the column. Here we will name the matrix as Z.
Calculating the Covariance of Z
To calculate the covariance of Z, we will take the matrix Z, and will transpose it. After transpose, we will multiply it by Z.
The output matrix will be the Covariance matrix of Z.
Calculating the Eigen Values and Eigen Vectors
Now we need to calculate the eigenvalues and eigenvectors for the resultant covariance matrix Z. Eigenvectors or the
covariance matrix are the directions of the axes with high information. And the coefficients of these eigenvectors are
defined as the eigenvalues.
Sorting the Eigen Vectors
In this step, we will take all the eigenvalues and will sort them in decreasing order, which means from largest to smallest.
And simultaneously sort the eigenvectors accordingly in matrix P of eigenvalues. The resultant matrix will be named as P*.
Calculating the new features Or Principal Components

Here we will calculate the new features. To do this, we will multiply the P* matrix to the Z. In the resultant matrix Z*, each
observation is the linear combination of original features. Each column of the Z* matrix is independent of each other.
Remove less or unimportant features from the new dataset.
The new feature set has occurred, so we will decide here what to keep and what to remove. It means, we will only keep
the relevant or important features in the new dataset, and unimportant features will be removed out.
Applications of Principal Component Analysis
PCA is mainly used as the dimensionality reduction technique in various AI applications such as computer vision, image
compression, etc.

It can also be used for finding hidden patterns if data has high dimensions. Some fields where PCA is used are Finance,
data mining, Psychology, etc.

Q.6 What is independent component analysis?


Ans. Prerequisite: Principal Component Analysis
Independent Component Analysis (ICA) is a machine learning technique to separate independent sources from a mixed
signal. Unlike principal component analysis which focuses on maximizing the variance of the data points, the
independent component analysis focuses on independence, i.e. independent components.
Attention reader! Don’t stop learning now. Get hold of all the important Machine Learning Concepts with the Machine
Learning Foundation Course at a student-friendly price and become industry ready.

Problem: To extract independent sources’ signals from a mixed signal composed of the signals from those sources.
Given: Mixed signal from five different independent sources.
Aim: To decompose the mixed signal into independent sources:
Source 1 Source 2 Source 3 Source 4 Source 5
Solution: Independent Component Analysis (ICA).
Consider Cocktail Party Problem or Blind Source Separation problem to understand the problem which is solved by
independent component analysis.

Here, There is a party going into a room full of people. There is ‘n’ number of speakers in that room and they are
speaking simultaneously at the party. In the same room, there are also ‘n’ number of microphones placed at different
distances from the speakers which are recording ‘n’ speakers’ voice signals. Hence, the number of speakers is equal to
the number must of microphones in the room.
Now, using these microphones’ recordings, we want to separate all the ‘n’ speakers’ voice signals in the room given
each microphone recorded the voice signals coming from each speaker of different intensity due to the difference in
distances between them. Decomposing the mixed signal of each microphone’s recording into independent source’s
speech signal can be done by using the machine learning technique, independent component analysis.
[ X1, X2, ….., Xn ] => [ Y1, Y2, ….., Yn ]
where, X1, X2, …, Xn are the original signals present in the mixed signal and Y1, Y2, …, Yn are the new features and are
independent components which are independent of each other.
Restrictions on ICA –
The independent components generated by the ICA are assumed to be statistically independent of each other.
The independent components generated by the ICA must have non-gaussian distribution.
The number of independent components generated by the ICA is equal to the nu
q.7 Explain multidimensional scaling in brief?
Ans. Multidimensional scaling (MDS) is a means of visualizing the level of similarity of individual cases of a dataset. MDS is
used to translate "information about the pairwise 'distances' among a set of objects or individuals" into a configuration
of points mapped into an abstract Cartesian space.[1]
More technically, MDS refers to a set of related ordination techniques used in information visualization, in particular to
display the information contained in a distance matrix. It is a form of non-linear dimensionality reduction.
Given a distance matrix with the distances between each pair of objects in a set, and a chosen number of dimensions, N,
an MDS algorithm places each object into N-dimensional space (a lower-dimensional representation) such that the
between-object distances are preserved as well as possible. For N = 1, 2, and 3, the resulting points can be visualized on
a scatter plot
Types
Classical multidimensional scaling
It is also known as Principal Coordinates Analysis (PCoA), Torgerson Scaling or Torgerson–Gower scaling. It takes an input
matrix giving dissimilarities between pairs of items and outputs a coordinate matrix whose configuration minimizes a loss
function called strain.
Metric multidimensional scaling (mMDS)
It is a superset of classical MDS that generalizes the optimization procedure to a variety of loss functions and input
matrices of known distances with weights and so on. A useful loss function in this context is called stress, which is often
minimized using a procedure called stress majorization.
Non-metric multidimensional scaling (nMDS)
In contrast to metric MDS, non-metric MDS finds both a non-parametric monotonic relationship between the dissimilarities
in the item-item matrix and the Euclidean distances between items, and the location of each item in the low-dimensional
space.
Generalized multidimensional scaling (GMD)
An extension of metric multidimensional scaling, in which the target space is an arbitrary smooth non-Euclidean space. In
cases where the dissimilarities are distances on a surface and the target space is another surface, GMDS allows finding the
minimum-distortion embedding of one surface into another.

q.8 Explain linear discriminant analysis in machine learning?


Ans. Linear Discriminant Analysis or Normal Discriminant Analysis or Discriminant Function Analysis is a dimensionality
reduction technique that is commonly used for supervised classification problems. It is used for modelling differences in
groups i.e. separating two or more classes. It is used to project the features in higher dimension space into a lower
dimension space.
For example, we have two classes and we need to separate them efficiently. Classes can have multiple features. Using
only a single feature to classify them may result in some overlapping as shown in the below figure. So, we will keep on
increasing the number of features for proper classification.

Attention reader! Don’t stop learning now. Get hold of all the important Machine Learning Concepts with the Machine
Learning Foundation Course at a student-friendly price and become industry ready.

Example:
Suppose we have two sets of data points belonging to two different classes that we want to classify. As shown in the
given 2D graph, when the data points are plotted on the 2D plane, there’s no straight line that can separate the two
classes of the data points completely. Hence, in this case, LDA (Linear Discriminant Analysis) is used which reduces the
2D graph into a 1D graph in order to maximize the separability between the two classes.
Here, Linear Discriminant Analysis uses both the axes (X and Y) to create a new axis and projects data onto a new axis in
a way to maximize the separation of the two categories and hence, reducing the 2D graph into a 1D graph.

Two criteria are used by LDA to create a new axis:


Maximize the distance between means of the two classes.
Minimize the variation within each class.

In the above graph, it can be seen that a new axis (in red) is generated and plotted in the 2D graph such that it
maximizes the distance between the means of the two classes and minimizes the variation within each class. In simple
terms, this newly generated axis increases the separation between the data points of the two classes. After generating
this new axis using the above-mentioned criteria, all the data points of the classes are plotted on this new axis and are
shown in the figure given below.

But Linear Discriminant Analysis fails when the mean of the distributions are shared, as it becomes impossible for LDA
to find a new axis that makes both the classes linearly separable. In such cases, we use non-linear discriminant analysis.
Extensions to LDA:
Quadratic Discriminant Analysis (QDA): Each class uses its own estimate of variance (or covariance when there are
multiple input variables).
Flexible Discriminant Analysis (FDA): Where non-linear combinations of inputs are used such as splines.
Regularized Discriminant Analysis (RDA): Introduces regularization into the estimate of the variance (actually
covariance), moderating the influence of different variables on LDA.
q.9 Diffrence between PCA and ICA?
Ans. Difference between PCA and ICA –
Principal Component Analysis Independent Component Analysis

It decomposes the mixed signal into its independent


It reduces the dimensions to avoid the problem of overfitting. sources’ signals.

It deals with the Principal Components. It deals with the Independent Components.

It doesn’t focus on the issue of variance among the data


It focuses on maximizing the variance. points.

It focuses on the mutual orthogonality property of the principal It doesn’t focus on the mutual orthogonality of the
components. components.

It doesn’t focus on the mutual independence of the It focuses on the mutual independence of the
components. components.

UNIT 4
Short questions:

q.1 write the basics of decision tree?


Ans. A decision tree is a graphical representation of all possible solutions to a decision based on certain conditions. On
each step or node of a decision tree, used for classification, we try to form a condition on the features to separate all the
labels or classes contained in the dataset to the fullest purity.

q.2 write the issues in decision tree?


Ans. Overfitting the data
Guarding against bad attribute choices
Handling continuous valued attributes
Handling missing attribute values
Handling attributes with differing costs
q.3 Write the formula of Information gain and entropy?
Ans. The actual formula for calculating Information Entropy is:

Information Gain is calculated for a split by subtracting the weighted entropies of each branch from the original entropy.
When training a Decision Tree using these metrics, the best split is chosen by maximizing Information Gain.
Gain(A) = Info(D) − InfoA (D).

q.4 What is perceptron?


Ans.The Perceptron is a linear machine learning algorithm for binary classification tasks. It may be considered one of the
first and one of the simplest types of artificial neural networks. ... The Perceptron Classifier is a linear algorithm that can
be applied to binary classification tasks.

q.5 What is neuron?


Ans. A neuron or nerve cell is an electrically excitable cell that communicates with other cells via specialized connections
called synapses. It is the main component of nervous tissue in all animals except sponges and placozoa. ... A typical neuron
consists of a cell body (soma), dendrites, and a single axon.

q.6 What is ANN?


Ans. The term "Artificial neural network" refers to a biologically inspired sub-field of artificial intelligence modeled after
the brain. An Artificial neural network is usually a computational network based on biological neural networks that
construct the structure of the human brain.

q.7 What is CNN?


Ans. A convolutional neural network (CNN) is a type of artificial neural network used in image recognition and processing
that is specifically designed to process pixel data.

q.8 Explain the layers of CNN?


Ans. 4.1 Input Layer · 4.2. Convo Layer · 4.3. Pooling Layer · 4.4. Fully Connected Layer(FC) · 4.5. Softmax /
Logistic Layer · 4.6. Output Layer.

q.9 What is multilayer feed forward neural network?


Ans. A multilayer feedforward neural network is an interconnection of perceptrons in which data and calculations flow in a
single direction, from the input data to the outputs. The number of layers in a neural network is the number of layers of
perceptrons.
q.10 What is convergence analysis?
Ans. A machine learning model reaches convergence when it achieves a state during training in which loss settles to within
an error range around the final value. In other words, a model converges when additional training will not improve the
model.

Long question:

q.1 Explain the id3 algorithm with the help of numerical?

https://www.youtube.com/watch?v=coOTEc-0OGw

q.2 Compare ANN and Bayesian networks.


Ans. ayesian Network
In this network, the graph represents the conditional needs of various variables classified in the model. Each node
represents a variable, and concentrating edge represents a conditional relationship. The graphical model can also be a
visible image of the chain rule.
Neural Network
In a neural network, each node could also be a simulated "neuron". The cell is essentially on or off, and its activation is
determined by a linear combination of the values of each output inside the preceding "layer" of the network.

q.3 Discuss overfitting and underfitting situation in decision tree learning


Ans. Underfitting:
A statistical model or a machine learning algorithm is said to have underfitting when it cannot capture the underlying
trend of the data. (It’s just like trying to fit undersized pants!) Underfitting destroys the accuracy of our machine learning
model. Its occurrence simply means that our model or the algorithm does not fit the data well enough. It usually
happens when we have fewer data to build an accurate model and also when we try to build a linear model with fewer
non-linear data. In such cases, the rules of the machine learning model are too easy and flexible to be applied on such
minimal data and therefore the model will probably make a lot of wrong predictions. Underfitting can be avoided by
using more data and also reducing the features by feature selection.
In a nutshell, Underfitting – High bias and low variance

Techniques to reduce underfitting:


Increase model complexity
Increase the number of features, performing feature engineering
Remove noise from the data.
Increase the number of epochs or increase the duration of training to get better results.
Overfitting:
A statistical model is said to be overfitted when we train it with a lot of data (just like fitting ourselves in oversized
pants!). When a model gets trained with so much data, it starts learning from the noise and inaccurate data entries in
our data set. Then the model does not categorize the data correctly, because of too many details and noise. The causes
of overfitting are the non-parametric and non-linear methods because these types of machine learning algorithms have
more freedom in building the model based on the dataset and therefore they can really build unrealistic models. A
solution to avoid overfitting is using a linear algorithm if we have linear data or using the parameters like the maximal
depth if we are using decision trees.
In a nutshell, Overfitting – High variance and low bias

q.4 What is the difference between machine learning and deep learning?
Machine Learning is a subset of artificial intelligence focusing on a specific goal: setting computers up to be able to
perform tasks without the need for explicit programming.
Computers are fed structured data (in most cases) and ‘learn’ to become better at evaluating and acting on that data over
time.
Think of ‘structured data’ as data inputs you can put in columns and rows. You might create a category column in Excel
called ‘food’, and have row entries such as ‘fruit’ or ‘meat’. This form of ‘structured’ data is very easy for computers to
work with, and the benefits are obvious (It’s no coincidence that one of the most important data programming languages
is called ‘structured query language’).
Once programmed, a computer can take in new data indefinitely, sorting and acting on it without the need for further
human intervention.
Over time, the computer may be able to recognize that ‘fruit’ is a type of food even if you stop labeling your data. This
‘self-reliance’ is so fundamental to machine learning that the field breaks down into subsets based on how much ongoing
human help is involved.

What is deep learning?


Machine learning is about computers being able to perform tasks without being explicitly programmed… but the
computers still think and act like machines. Their ability to perform some complex tasks — gathering data from an image
or video, for example — still falls far short of what humans are capable of.
Deep learning models introduce an extremely sophisticated approach to machine learning and are set to tackle these
challenges because they've been specifically modeled after the human brain. Complex, multi-layered “deep neural
networks” are built to allow data to be passed between nodes (like neurons) in highly connected ways. The result is a non-
linear transformation of the data that is increasingly abstract.
While it takes tremendous volumes of data to ‘feed and build’ such a system, it can begin to generate immediate results,
and there is relatively little need for human intervention once the programs are in place.

q.5 Describe BPN algorithm in ANN along with a suitable example.


Ans. Backpropagation “How does backpropagation work?” Backpropagation learns by iteratively processing a data set of
training tuples, comparing the network’s prediction for each tuple with the actual known target value. The target value
may be the known class label of the training tuple (for classification problems) or a continuous value (for numeric
prediction). For each training tuple, the weights are modified so as to minimize the mean-squared error between the
network’s prediction and the actual target value. These modifications are made in the “backwards” direction (i.e., from
the output layer) through each hidden layer down to the first hidden layer (hence the name backpropagation).

Han n kamber book


http://myweb.sabanciuniv.edu/rdehkharghani/files/2016/02/The-Morgan-Kaufmann-Series-in-Data-Management-Systems-
Jiawei-Han-Micheline-Kamber-Jian-Pei-Data-Mining.-Concepts-and-Techniques-3rd-Edition-Morgan-Kaufmann-2011.pdf
pg. no. 403 numerical is given in the book.
q.6 Illustrate the operation of the ID3 training example. Consider information gain as
attribute measure.

Ans. https://www.youtube.com/watch?v=coOTEc-0OGw

q.7 Explain the layers of cnn ?


Layers in CNN There are five different layers in CNN Input layer Convo layer (Convo + ReLU)
Pooling layer Fully connected(FC) layer Softmax/logistic layer Output layer

Different layers of CNN

4.1 Input Layer


Input layer in CNN should contain image data. Image data is represented by three dimensional matrix as we saw earlier. You
need to reshape it into a single column. Suppose you have image of dimension 28 x 28 =784, you need to convert it into 784
x 1 before feeding into input. If you have “m” training examples then dimension of input will be (784, m).

4.2. Convo Layer


Convo layer is sometimes called feature extractor layer because features of the image are get extracted within this layer.
First of all, a part of image is connected to Convo layer to perform convolution operation as we saw earlier and calculating
the dot product between receptive field(it is a local region of the input image that has the same size as that of filter) and
the filter. Result of the operation is single integer of the output volume. Then we slide the filter over the next receptive field
of the same input image by a Stride and do the same operation again. We will repeat the same process again and again
until we go through the whole image. The output will be the input for the next layer.
Convo layer also contains ReLU activation to make all negative value to zero.
4.3. Pooling Layer

Source : CS231n Convolutional Neural Network


Pooling layer is used to reduce the spatial volume of input image after convolution. It is used between two convolution
layer. If we apply FC after Convo layer without applying pooling or max pooling, then it will be computationally expensive
and we don’t want it. So, the max pooling is only way to reduce the spatial volume of input image. In the above example,
we have applied max pooling in single depth slice with Stride of 2. You can observe the 4 x 4 dimension input is reduce to 2
x 2 dimension.
There is no parameter in pooling layer but it has two hyperparameters — Filter(F) and Stride(S).
In general, if we have input dimension W1 x H1 x D1, then
W2 = (W1−F)/S+1
H2 = (H1−F)/S+1
D2 = D1
Where W2, H2 and D2 are the width, height and depth of output.

4.4. Fully Connected Layer(FC)


Fully connected layer involves weights, biases, and neurons. It connects neurons in one layer to neurons in another layer.
It is used to classify images between different category by training.

4.5. Softmax / Logistic Layer


Softmax or Logistic layer is the last layer of CNN. It resides at the end of FC layer. Logistic is used for binary classification
and softmax is for multi-classification.

4.6. Output Layer


Output layer contains the label which is in the form of one-hot encoded.
Now you have a good understanding of CNN. Let’s implement a CNN in Keras.

https://youtu.be/coOTEc-0OGw
https://youtu.be/NsmAEmoSRjk
https://youtu.be/jBb8I9BpJrU
https://youtu.be/3EQw8awLQJ4
https://youtu.be/XzSlEA4ck2I
https://youtu.be/CJjSPCslxqQ
https://youtu.be/P2KZisgs4A4
https://youtu.be/K2sBRVCXZqs
https://youtu.be/v9tWTiMd0iE
https://youtu.be/i_bx7LI_h_4
https://youtu.be/7fnarUsRMO0
https://youtu.be/87oLR76aK2g
https://youtu.be/Y1dxfValzY0
https://youtu.be/wzm1NqZSpys
https://youtu.be/hODHKaSv1n0

You might also like