DL Unit-1
DL Unit-1
COURSE MATERIAL
UNIT 1
COURSE B.TECH
SEMESTER 4-1
Version V-1
1|D L - U N I T - I
BTECH_CSE-SEM 4 1
SVEC TIRUPATI
BTECH_CSE-SEM 4 1
SVEC TIRUPATI
10 ASSIGNMENTS 34
11 PART A QUESTIONS & ANSWERS (2 MARKS QUESTIONS) 35
12 PART B QUESTIONS 35
13 SUPPORTIVE ONLINE CERTIFICATION COURSES 35
14 REAL TIME APPLICATIONS 35
15 CONTENTS BEYOND THE SYLLABUS 37
16 PRESCRIBED TEXT BOOKS & REFERENCE BOOKS 37
17 MINI PROJECT SUGGESTION 37
3|D L - U N I T - I
BTECH_CSE-SEM 4 1
SVEC TIRUPATI
1. Course Objectives
The objectives of this course is to
1. To demonstrate the major technology trends driving Deep Learning.
2. To build, train and apply fully connected neural networks.
3. To implement efficient neural networks.
4. To analyze the key parameters and hyper perameters in neural network’s
architecture.
5. To apply concepts of Deep Learning to solve real word problems.
2. Prerequisites
This course is intended for senior undergraduate and junior graduate students
who have a proper understanding of
Python Programming Language
Calculus
Linear Algebra
Probability Theory
Although it would be helpful, knowledge about classical machine learning is NOT
required.
3. Syllabus
UNIT I
Introduction to Linear Algebra: Scalars, Vectors, Matrices And Tensors, Matrix
Operations, Types Of Matrices, Norms, Eigen Decomposition, Single Value
Decomposition, Principal Component Analysis.
Probability and Information Theory: Random Variables, Probability Distribution,
Marginal Distribution, Conditional Probability, Expectations, Variance And
Covariance, Baye’s Rule, Information Theory.
Numerical Computation: Overflow and Underflow, Gradient-Based Optimization,
Constarint –Based Optimization, Linear Least Squares.
4. Course outcomes
1. Demonstrate the mathematical foundation of neural network.
2. Describe the machine learning basics.
3. Differentiate architecture of deep neural network.
4. Build the convolution neural network.
5. Build and Train RNN and LSTMs.
4|D L - U N I T - I
BTECH_CSE-SEM 4 1
SVEC TIRUPATI
5.Co-PO / PSO Mapping
Machine
PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 P10 PO11 PO12 PSO1 PSO2
Tools
CO1 3 2
CO2 3 2
CO3 3 3 2 2 3 2 2
CO4 3 3 2 2 3 2 2
CO5
6. Lesson Plan
5|D L - U N I T - I
BTECH_CSE-SEM 4 1
SVEC TIRUPATI
2. You will work on case studies from healthcare, autonomous driving, sign
language reading, music generation, and natural language processing. You
will master not only the theory, but also see how it is applied in industry.
8. Lecture Notes
• Scalars: A scalar is just a single number, in contrast to most of the other objects
studied in linear algebra, which are usually arrays of multiple numbers. We write
scalars in italics. We usually give scalars lower-case variable names. When we
introduce them, we specify what kind of number they are.
For example, we might say “Let s ∈ R be the slope of the line,” while defining a
real-valued scalar, or “Let n ∈ N be the number of units,” while defining a natural
number scalar.
X=[ x1
x2
...
6|D L - U N I T - I
BTECH_CSE-SEM 4 1
SVEC TIRUPATI
Xn ]
We can think of vectors as identifying points in space, with each element giving
the coordinate along a different axis.
We use the − sign to index the complement of a set. For example x−1 is the
vector containing all elements of x except for x1, and x−S is the vector
containing all of the elements of x except for x1, x3 and x6.
A3,1 A3,2
Figure : The transpose of the matrix can be thought of as a mirror image across
the main diagonal
[ A1,1 A1,2
A2,1 A2,2 ]
Sometimes we may need to index matrix-valued expressions that are not just a
single letter. In this case, we use subscripts after the expression, but do not
convert anything to lower case. For example, f (A)i,j gives element (i, j) of the
matrix computed by applying the function f to A.
7|D L - U N I T - I
BTECH_CSE-SEM 4 1
SVEC TIRUPATI
Tensors: In some cases we will need an array with more than two axes. In the
general case, an array of numbers arranged on a regular grid with a variable
number of axes is known as a tensor. We denote a tensor named “A” with this
typeface: A. We identify the element of A at coordinates (i, j, k) by writing Ai,j,k.
Operations on Matrices
Addition, subtraction and multiplication are the basic operations on the matrix.
To add or subtract matrices, these must be of identical order and for
multiplication, the number of columns in the first matrix equals the number of
rows in the second matrix.
Addition of Matrices
Subtraction of Matrices
Scalar Multiplication of Matrices
Multiplication of Matrices
Addition of Matrices
If A[aij]mxn and B[bij]mxn are two matrices of the same order, then their sum A + B is
a matrix, and each element of that matrix is the sum of the corresponding
elements. i.e. A + B = [aij + bij]mxn
Consider the two matrices A & B of order 2 x 2. Then the sum is given by:
[a1 b1 + [a2 b2 = [a1+a2 b1+b2
c1 d1] c2 d2] c1+c2 d1+d2]
Properties of Matrix Addition: If a, B and C are matrices of same order, then
(a) Commutative Law: A + B = B + A
(b) Associative Law: (A + B) + C = A + (B + C)
(c) Identity of the Matrix: A + O = O + A = A, where O is zero matrix which is
additive identity of the matrix,
(d) Additive Inverse: A + (-A) = 0 = (-A) + A, where (-A) is obtained by changing
the sign of every element of A which is additive inverse of the matrix,
(e) A+B=A+C }
B+A=C+A } ⇒ B=C
(f) tr(A ± B) = tr(A) ± tr(B)
8|D L - U N I T - I
BTECH_CSE-SEM 4 1
SVEC TIRUPATI
(g) If A + B = 0 = B + A, then B is called additive inverse of A and also A is called
the additive inverse of A.
Subtraction of Matrices
If A and B are two matrices of the same order, then we define A−B=A+(−B).
Consider the two matrices A & B of order 2 x 2. Then the difference is given by:
[a1 b1] – [a2 b2 [a1−a2 b1−b2
c1 d1] c2 d2] = c1−c2 d1−d2]
We can subtract the matrices by subtracting each element of one matrix from
the corresponding element of the second matrix. i.e. A – B = [aij – bij]mxn
Scalar Multiplication of Matrices
If A = [aij]m×n is a matrix and k any number, then the matrix which is obtained by
multiplying the elements of A by k is called the scalar multiplication of A by k
and it is denoted by k A thus if A = [aij]m×n. Then kAm×n=Am×nk=[kai×j]
Properties of Scalar Multiplication: If A, B are matrices of the same order and λ
and μ are any two scalars then;
(a) λ(A + B) = λA + λB
(b) (λ + μ)A = λA + μA
(c) λ(μA) = (λμA) = μ(λA)
(d) (-λA) = -(λA) = λ(-A)
(e) tr(kA) = k tr(A)
Multiplication of Matrices
If A and B be any two matrices, then their product AB will be defined only when
the number of columns in A is equal to the number of rows in B.
If A=[aij]m×n.and B=[bij]n×p then their product AB=C=[cij]m×p will be a matrix
of order m×p where (AB)ij=Cij=∑r=1nai r brj
Properties of matrix multiplication
(a) Matrix multiplication is not commutative in general, i.e. in general AB≠BA.
(b) Matrix multiplication is associative, i.e. (AB)C = A(BC).
(c) Matrix multiplication is distributive over matrix addition, i.e. A.(B + C) = A.B +
A.C and (A + B)C = AC + BC.
9|D L - U N I T - I
BTECH_CSE-SEM 4 1
SVEC TIRUPATI
(d) If A is an m × n matrix, then ImA=A=AIn.
(e) The product of two matrices can be a null matrix while neither of them is null,
i.e. if AB = 0, it is not necessary that either A = 0 or B = 0.
(f) If A is an m × n matrix and O is a null matrix then Am×n.On×p=Om×p. i.e. the
product of the matrix with a null matrix is always a null matrix.
(g) If AB = 0 (It does not mean that A = 0 or B = 0, again the product of two non-
zero matrices may be a zero matrix).
(h) If AB = AC , B ≠ C (Cancellation Law is not applicable).
(i) tr(AB) = tr(BA)
(j) There exist a multiplicative identity for every square matrix such AI = IA = A
Divided into 6 parts to cover the main types of matrices; they are:
1. Square Matrix
2. Symmetric Matrix
3. Triangular Matrix
4. Diagonal Matrix
5. Identity Matrix
6. Orthogonal Matrix
Square Matrix
A square matrix is a matrix where the number of rows (n) equals the number of columns
(m).
n=m
The square matrix is contrasted with the rectangular matrix where the number of
rows and columns are not equal.
Given that the number of rows and columns match, the dimensions are usually
denoted as n, e.g. n x n. The size of the matrix is called the order, so an order 4
square matrix is 4 x 4.
The vector of values along the diagonal of the matrix from the top left to the
bottom right is called the main diagonal.
Below is an example of an order 3 square matrix.
1, 2, 3
10|D L - U N I T - I
BTECH_CSE-SEM 4 1
SVEC TIRUPATI
M = ( 1, 2, 3 )
1, 2, 3
Square matrices are readily added and multiplied together and are the basis of
many simple linear transformations, such as rotations (as in the rotations of
images).
Symmetric Matrix
A symmetric matrix is a type of square matrix where the top-right triangle is the
same as the bottom-left triangle.
To be symmetric, the axis of symmetry is always the main diagonal of the matrix,
from the top left to the bottom right.
Below is an example of a 5×5 symmetric matrix.
1, 2, 3, 4, 5
2, 1, 2, 3, 4
M = (3, 2, 1, 2, 3)
4, 3, 2, 1, 2
5, 4, 3, 2, 1
A symmetric matrix is always square and equal to its own transpose. M = M^T
Triangular Matrix
A triangular matrix is a type of square matrix that has all values in the upper-right or
lower-left of the matrix with the remaining elements filled with zero values.
A triangular matrix with values only above the main diagonal is called an upper
triangular matrix. Whereas, a triangular matrix with values only below the main
diagonal is called a lower triangular matrix.
Diagonal Matrix
A diagonal matrix is one where values outside of the main diagonal have a zero
value, where the main diagonal is taken from the top left of the matrix to the
bottom right.
11|D L - U N I T - I
BTECH_CSE-SEM 4 1
SVEC TIRUPATI
A diagonal matrix is often denoted with the variable D and may be represented as
a full matrix or as a vector of values on the main diagonal.
Below is an example of a 3×3 square diagonal matrix.
1, 0, 0
D = (0, 2, 0)
0, 0, 3
As a vector, it would be represented as: d = (1, 2, 3)
Identity Matrix
An identity matrix is a square matrix that does not change a vector when
multiplied.
The values of an identity matrix are known. All of the scalar values along the
main diagonal (top-left to bottom-right) have the value one, while all other
values are zero.
For example, an identity matrix with the size 3 or I3 would be as follows:
1, 0, 0
I = (0, 1, 0)
0, 0, 1
Orthogonal Matrix
Two vectors are orthogonal when their dot product equals zero, called
orthonormal.
v . w = 0 or v . w^T = 0
1.5 NORMS
To understand what norms of a vectors are let us recall that vectors are an
ordered finite list of numbers like this:
12|D L - U N I T - I
BTECH_CSE-SEM 4 1
SVEC TIRUPATI
The vector x in this example has two elements, therefore we can easily plot the
vector in a 2D-Plane, as follows:
In the above plot the first element of the vector corresponds to the x-value
and the second element of the vector corresponds to the y-value. Nice to know
what the elements of the vector correspond to, but what are the following
attributes of a vector?
As you can see in the plot a vector is further characterized by its norm, which
is the distance of the vector from the origin at x,y = 0, and it’s angle. The norm is
calculated like this:
1.6. EIGENDECOMPOSITION
eigenvector such that vT A= λVT, but we are usually concerned with right
eigenvectors). If v is an eigenvector of A, then so is any rescaled vector sv for s ∈
13|D L - U N I T - I
BTECH_CSE-SEM 4 1
SVEC TIRUPATI
R, s = 0. Moreover, sv still has the same eigenvalue. For this reason, we usually
only look for unit eigenvectors. Suppose that a matrix A has n linearly
independent eigenvectors, {v(1) , . . . , v(n) }, with corresponding eigenvalues
{λ1, . . . , λn}. We may concatenate all of the eigenvalues and eigenvectors.
BTECH_CSE-SEM 4 1
SVEC TIRUPATI
We can actually interpret the singular value decomposition of A in terms of the
eigendecomposition of functions of A. The left-singular vectors of A are the
eigenvectors of AAT. The right-singular vectors of A are the eigenvectors of AT A.
The non-zero singular values of A are the square roots of the eigenvalues of AT A.
The same is true for AAT
The eigendecomposition of a matrix tells us many useful facts about the matrix.
The matrix is singular if and only if any of the eigenvalues are zero.
into Rn. Let g(c) = Dc, where D ∈ Rn×l is the matrix defining the decoding.
Computing the optimal code for this decoder could be a difficult problem. To
keep the encoding problem easy, PCA constrains the columns of D to be
orthogonal to each other.
In order to turn this basic idea into an algorithm we can implement, the first thing
we need to do is figure out how to generate the optimal code point c∗ for each
input point x. One way to do this is to minimize the distance between the input
15|D L - U N I T - I
BTECH_CSE-SEM 4 1
SVEC TIRUPATI
point x and its reconstruction, g(c ∗). We can measure this distance using a
norm. In the principal components algorithm, we use the L2 norm:
(x − g(c))*(x − g(c))
We can now change the function being minimized again, to omit the first term,
since this term does not depend on c:
16|D L - U N I T - I
BTECH_CSE-SEM 4 1
SVEC TIRUPATI
c* = arg min 2x*g(c) + g(c) *g(c).
−
To make further progress, we must substitute in the definition of g(c):
−
c* = arg min 2x*Dc + c*D*Dc
c
−
= arg min 2x* Dc + c*Il c c
∇ c(−2x*Dc + c*c) = 0
— 2D*x + 2c = 0
c = D*x.
This makes the algorithm efficient: we can optimally encode x just using a matrix-
vector operation. To encode a vector, we apply the encoder function
f (x) = D*x.
Using a further matrix multiplication, we can also define the PCA reconstruction
operation:
Next, we need to choose the encoding matrix D. To do so, we revisit the idea of
minimizing the L2 distance between inputs and reconstructions.
While probability theory allows us to make uncertain statements and reason in the
presence of uncertainty, information theory allows us to quantify the amount of
uncertainty in a probability distribution.
1|D L - U N I T - I
BTECH_CSE-SEM 31
SVEC TIRUPATI
Why Probability?
3. Incomplete modeling. When we use a model that must discard some of the
information we have observed, the discarded information results in uncertainty in
the model’s predictions. For example, suppose we build a robot that can exactly
observe the location of every object around it. If the robot discretizes space when
predicting the future location of these objects, then the discretization makes the
robot immediately become uncertain about the precise position of objects: each
object could be anywhere within the discrete cell that it was observed to occupy.
2|D L - U N I T - I
BTECH_CSE-SEM 31
SVEC TIRUPATI
Random variables may be discrete or continuous. A discrete random variable is one
that has a finite or countably infinite number of states. Note that these states are not
necessarily the integers; they can also just be named states that are not considered
to have any numerical value. A continuous random variable is associated with a real
value.
1.11. PROBABILITY DISTRIBUTIONS
A probability distribution is a description of how likely a random variable or set of
random variables is to take on each of its possible states. The way we describe
probability distributions depends on whether the variables are discrete or continuous.
The probability mass function maps from a state of a random variable to the
probability of that random variable taking on that state. The probability that x = x is
denoted as P (x), with a probability of 1 indicating that x = x is certain and a
probability of 0 indicating that x = x is impossible. Sometimes to disambiguate which
PMF to use, we write the name of the random variable explicitly: P (x = x). Sometimes
we define a variable first, then use ∼ notation to specify which distribution it follows
later: x ∼ P (x).
Probability mass functions can act on many variables at the same time. Such a
probability distribution over many variables is known as a joint probability distribution.
P (x = x, y = y ) denotes the probability that x = x and y = y simultaneously. We may
also write P (x, y) for brevity. To be a probability mass function on a random variable
x, a function P must satisfy the following properties:
BTECH_CSE-SEM 31
SVEC TIRUPATI
states equally likely—by setting its probability mass function to
P (x = xi) =1/k
For example, suppose we have discrete random variables x and y, and we know
∀ x ∈ x, P (x = x) = P (x = x, y = y).
The name “marginal probability” comes from the process of computing marginal
probabilities on paper. When the values of P (x, y ) are written in a grid with diff erent
values of x in rows and diff erent values of y in columns, it is natural to sum across a
row of the grid, then write P(x) in the margin of the paper just to the right of the row.
In many cases, we are interested in the probability of some event, given that some
other event has happened. This is called a conditional probability. We denote the
conditional probability that y = y given x = x as P(y = y | x = x).
The conditional probability is only defined when P(x = x) > 0. We cannot compute
the conditional probability conditioned on an event that never happens.
4|D L - U N I T - I
BTECH_CSE-SEM 31
SVEC TIRUPATI
The Chain Rule of Conditional Probabilities
Any joint probability distribution over many random variables may be decomposed
into conditional distributions over only one variable:
This observation is known as the chain rule or product rule of probability. It follows
immediately from the definition of conditional probability in equation
Two random variables x and y are independent if their probability distribution can
be expressed as a product of two factors, one involving only x and one involving
only y:
EXPECTATIONS
5|D L - U N I T - I
BTECH_CSE-SEM 31
SVEC TIRUPATI
Ex∼p[f(x)] = p(x)f(x)dx.
When the identity of the distribution is clear from the context, we may simply write
the name of the random variable that the expectation is over, as in Ex[f(x)]. If it is
clear which random variable the expectation is over, we may omit the subscript
entirely, as in E[f (x)]. By default, we can assume that E[·] averages over the values
of all the random variables inside the brackets. Likewise, when there is no
ambiguity, we may omit the square brackets.
VARIANCE
The variance gives a measure of how much the values of a function of a random
variable x vary as we sample different values of x from its probability distribution:
When the variance is low, the values of f (x) cluster near their expected value. The
square root of the variance is known as the standard deviation
COVARIANCE
The covariance gives some sense of how much two values are linearly related to
each other, as well as the scale of these variables:
High absolute values of the covariance mean that the values change very much
and are both far from their respective means at the same time. If the sign of the
covariance is positive, then both variables tend to take on relatively high values
simultaneously. If the sign of the covariance is negative, then one variable tends to
take on a relatively high value at the times that the other takes on a relatively low
value and vice versa. Other measures such as correlation normalize the
contribution of each variable in order to measure only how much the variables are
related, rather than also being affected by the scale of the separate variables.
6|D L - U N I T - I
BTECH_CSE-SEM 31
SVEC TIRUPATI
The notions of covariance and dependence are related, but are in fact distinct
2
concepts. They are related because two variables that are independent have zero
covariance, and two variables that have non-zero covariance are dependent.
How- ever, independence is a distinct property from covariance. For two variables
to have zero covariance, there must be no linear dependence between them.
Independence is a stronger requirement than zero covariance, because
independence also excludes nonlinear relationships. It is possible for two variables
to be dependent but have zero covariance. For example, suppose we first sample
a real number x from a uniform distribution over the interval [−1, 1]. We next sample
a random variables. With probability 1 , we choose the value of s to be 1.
Otherwise, we choose the value of s to be −1. We can then generate a random
variable y by assigning y = sx. Clearly, x and y are not independent, because x
completely determines the magnitude of y. However, Cov(x, y) = 0.
7|D L - U N I T - I
BTECH_CSE-SEM 31
SVEC TIRUPATI
1.16. INFORMATION THEORY
Specifically,
Likely events should have low information content, and in the extreme case,
events that are guaranteed to happen should have no information content
whatsoever.
Less likely events should have higher information content.
Independent events should have additive information. For example, finding
out that a tossed coin has come up as heads twice should convey twice as
much information as finding out that a tossed coin has come up as heads
once.
of an event x = x to be
we always use log to mean the natural logarithm, with base e. Our definition of I (x)
is therefore written in units of nats .
Self-information deals only with a single outcome. We can quantify the amount of
uncertainty in an entire probability distribution using the Shannon entropy:
also denoted H(P ). In other words, the Shannon entropy of a distribution is the
expected amount of information in an event drawn from that distribution. It gives a
lower bound on the number of bits (if the logarithm is base 2, otherwise the units are
different) needed on average to encode symbols drawn from a distribution P.
Distributions that are nearly deterministic (where the outcome is nearly certain)
have low entropy; distributions that are closer to uniform have high entropy When x
is continuous, the Shannon entropy is known as the differential entropy.
8|D L - U N I T - I
BTECH_CSE-SEM 31
SVEC TIRUPATI
If we have two separate probability distributions P (x) and Q (x) over the same
random variable x, we can measure how different these two distributions are using
the Kullback-Leibler (KL) divergence:
0.7
0.7
0.6
0.6
0.5
0.5
0.4 0. 0. 0. 0. 1.
0.4 2 4 6 8 0
0 0 0 0 1
. . . . .
Figure : This plot shows how distributions
2 4 that
6 are
8 closer
0 to
deterministic have low
Shannon entropy while 0.3 distributions that are close to uniform have high Shannon
0.3
entropy. On the horizontal axis, we plot p , the probability of a binary random
variable being equal to 1. The — entropy−is given
− by (p 1) log(1 p) p log p. When p
0.2
is near 0, the distribution0.2 is nearly deterministic, because the random variable is
nearly always 0. When p is near 1, the distribution is nearly deterministic, because
the random variable 0.1 is nearly always 1. When p = 0.5, the entropy is maximal,
0.1
because the distribution is uniform over the two outcomes.
9|D L - U N I T - I
BTECH_CSE-SEM 31
SVEC TIRUPATI
underflow.
One example of a function that must be stabilized against underflow and overflow
is the softmax function. The softmax function is often used to predict the
probabilities associated with a multinoulli distribution.
Poor Conditioning
BTECH_CSE-SEM 31
SVEC TIRUPATI
−f (x).
The function we want to minimize or maximize is called the objective func- tion or
criterion. When we are minimizing it, we may also call it the cost function, loss
function, or error function. In this book, we use these terms interchangeably,
though some machine learning publications assign special meaning to some of
these terms
We often denote the value that minimizes or maximizes a function with a
superscript . For example, we might say x = arg min f(x).
∗
Figure : An illustration of how the gradient descent algorithm uses the derivatives of a
function can be used to follow the function downhill to a minimum.
We assume the reader is already familiar with calculus, but provide a brief review
of how calculus concepts relate to optimization here.
Suppose we have a function y = f (x), where both x and y are real numbers.
The derivative of this function is denoted as f(x) or as dx
dy
. The derivative f (x)
gives the slope of f (x) at the point x. In other words, it specifies how to
scale a small change in the input in order to obtain the corresponding change
in the output:
f (x + ) ≈ f (x) + f (x).
BTECH_CSE-SEM 31
SVEC TIRUPATI
See the following figure
A point that obtains the absolute lowest value of f (x) is a global minimum. It is
possible for there to be only one global minimum or multiple global minima of the
function. It is also possible for there to be local minima that are not globally
optimal. In the context of deep learning, we optimize functions that may have
many local minima that are not optimal, and many saddle points surrounded by
very flat regions. All of this makes optimization very difficult, especially when the
input to the function is multidimensional. We therefore usually settle for finding a
value of f that is very low, but not necessarily minimal in any formal sense. See
figure for an example.
Figure : Optimization algorithms may fail to find a global minimum when there are
multiple local minima or plateaus present. In the context of deep learning, we
generally accept such solutions even though they are not truly minimal, so long
as they correspond to significantly low values of the cost function.
12|D L - U N I T - I
BTECH_CSE-SEM 31
SVEC TIRUPATI
the function curves upward. This axis is an eigenvector of the Hessian and has a
positive eigenvalue. Along the axis corresponding to x2, the function curves
downward. This direction is an eigenvector of the Hessian with negative
eigenvalue. The name “saddle point” derives from the saddle-like shape of this
function. This is the quintessential example of a function with a saddle point.
In more than one dimension, it is not necessary to have an eigenvalue of 0 in
order to get a saddle point: it is only necessary to have both positive and
negative eigenvalues. We can think of a saddle point with both signs of
eigenvalues as being a local maximum within one cross section and a local
minimum within another cross section.
Sometimes we wish not only to maximize or minimize a function f(x) over all
possible values of x. Instead we may wish to find the maximal or minimal value of f
(x) for values of x in some set S. This is known as constrained optimization. Points x
that lie within the set S are called feasible points in constrained optimization
terminology.
We often wish to find a solution that is small in some sense. A common approach
in such situations is to impose a norm constraint, such as ||x|| ≤ 1.
The Karush–Kuhn–Tucker (KKT) approach provides a very general so- lution to
constrained optimization. With the KKT approach, we introduce a new function
called the generalized Lagrangian or generalized Lagrange function.
We introduce new variables λi and α j for each constraint, these are called the KKT
multipliers. The generalized Lagrangian is then defined as
13|D L - U N I T - I
BTECH_CSE-SEM 31
SVEC TIRUPATI
x λ α,α≥0
has the same optimal objective function value and set of optimal points x as
min f (x).
x∈ S
= f (x),
=∞
To perform constrained maximization, we can construct the generalized La- grange
function of −f (x), which leads to this optimization problem:
A simple set of properties describe the optimal points of constrained opti- mization
problems. These properties are called the Karush-Kuhn-Tucker (KKT) conditions
They are necessary conditions, but not always sufficient conditions, for a point to be
optimal. The conditions are:
The gradient of the generalized Lagrangian is zero.
All constraints on both x and the KKT multipliers are satisfied.
The inequality constraints exhibit “complementary slackness”: α h(x) = 0
1
f (x) = ||Ax − b||2.
14|D L - U N I T - I
BTECH_CSE-SEM 31
SVEC TIRUPATI
There are specialized linear algebra algorithms that can solve this problem
efficiently. However, we can also explore how to solve it using gradient-based
optimization as a simple example of how these techniques work.
First, we need to obtain the gradient:
∇x f (x) = A (Ax − b) = A Ax − A b.
Algorithm
An algorithm to minimize f(x) = 1 ||Ax − b||2 with respect to x
using gradient descent, starting from an arbitrary value of x.
Set the step size ( ) and tolerance (δ) to small, positive numbers.
while ||A Ax − A b||2 > δ do
x ← x− A Ax − A b
endend
while
while
9. Practice Quiz
1. The inverse of the matrix is possible only for
a) Non-singular matrix
b) Singular matrix
c) Zero Matrix
d) Symetric Matrix
2. Suppose that price of 2 ball and 1 bat is 100 units, then What will be
representation of problems in Linear Algebra in the form of x and y?
a) 2x + y = 100
b. 2x + 2y = 100
c. 2x + y = 200
d. x + y = 100
3. What is the first step in linear algebra?
A. Let's complicate the problem
B. Solve the problem
C. Visualise the problem
D. None Of the above
4. Which of the following is not a type of matrix?
A. Square Matrix
B. Scalar Matrix
C. Trace Matrix
D. Term Matrix
5 The matrix which is the sum of all the diagonal elements of a square
matrix?
a)Diagonal matrix
b)Trace matrix
15|D L - U N I T - I
BTECH_CSE-SEM 31
SVEC TIRUPATI
c)both
d)Identity matrix
6 Multiplication of a matrix with a scalar constant is called?
A.Complex multiplication
B. Linear multiplication
C. Scalar multiplication
D. Constant multiplication
7. Which of the following is correct method to solve matrix equations?
A. Row Echelon Form
B. Inverse of a Matrix
C. Both A and B
D. None Of the above
8. ______________ is equal to the maximum number of linearly independent
row vectors in a matrix.
A. Row matrix
B. Rank of a matrix
C. Term matrix
D. Linear matrix
9. Vectors whose direction remains unchanged even after applying linear
transformation with the matrix are called?
A. Eigenvalues
B. Eigenvectors
C. Cofactor matrix
D. Minor of a matrix
10. The concept of Eigen values and vectors is applicable to?
A. Scalar matrix
B. Identity matrix
C. Upper triangular matrix
D. Square matrix
11. Singular matrix are?
A. non-invertible
B. invertible
C. Both non-invertible and invertible
D. None Of the above
A. Singular
B. Eigen vector
C. Eigen value
D. None Of the above
16|D L - U N I T - I
BTECH_CSE-SEM 31
SVEC TIRUPATI
10. Assignments
S.No Question BL CO
Discuss the five types of matrices.
1 6 2
Write and explain in detail about principal component analysis
2 5 1
with suitable examples.
3 Explain in detail about the Baye’s rule 5 1
Compare expectation, variance and covariance
4 2 2
Discuss in detail about the gradient based optimization and
5 constrained based optimization. 6 2
BTECH_CSE-SEM 31
SVEC TIRUPATI
Ans. Suppose when we have to determine the equation of line
of best fit for the given data, then we first use the following
formula.
The equation of least square line is given by Y = a + bX.
Normal equation for 'a':
∑Y = na + b∑X.
Normal equation for 'b':
∑XY = a∑X + b∑X2
S.No Question BL CO
1 Write and explain in detail about principal component analysis 2 1
with suitable examples.
2 Explain in detail about the Baye’s rule 2 1
3 Compare expectation, variance and covariance 2 4
18|D L - U N I T - I
BTECH_CSE-SEM 31
SVEC TIRUPATI
14. Real Time Applications
S.No Application CO
1 Virtual Assistants 1
Virtual Assistants are cloud-based applications that understand natural
language voice commands and complete tasks for the user. Amazon
Alexa, Cortana, Siri, and Google Assistant are typical examples of virtual
assistants. They need internet-connected devices to work with their full
capabilities. Each time a command is fed to the assistant, they tend to
provide a better user experience based on past experiences using Deep
Learning algorithms.
2 Chatbots 1
Chatbots can solve customer problems in seconds. A chatbot is an AI
application to chat online via text or text-to-speech. It is capable of
communicating and performing actions similar to a human. Chatbots are
used a lot in customer interaction, marketing on social network sites, and
instant messaging the client. It delivers automated responses to user
inputs. It uses machine learning and deep learning algorithms to
generate different types of reactions.
The next important deep learning application is related to Healthcare.
3 Healthcare 1
Deep Learning has found its application in the Healthcare sector.
Computer-aided disease detection and computer-aided diagnosis have
been possible using Deep Learning. It is widely used for medical research,
drug discovery, and diagnosis of life-threatening diseases such as cancer
and diabetic retinopathy through the process of medical imaging.
4 Entertainment 1
Companies such as Netflix, Amazon, YouTube, and Spotify give relevant
movies, songs, and video recommendations to enhance their customer
experience. This is all thanks to Deep Learning. Based on a person’s
browsing history, interest, and behavior, online streaming companies give
suggestions to help them make product and service choices. Deep
learning techniques are also used to add sound to silent movies and
generate subtitles automatically.
5 News Aggregation and Fake News Detection 1
Deep Learning allows you to customize news depending on the readers’
persona. You can aggregate and filter out news information as per
social, geographical, and economic parameters and the individual
preferences of a reader. Neural Networks help develop classifiers that
can detect fake and biased news and remove it from your feed. They
also warn you of possible privacy breaches.
6 Image Coloring 1
Image colorization has seen significant advancements using Deep
Learning. Image colorization is taking an input of a grayscale image and
19|D L - U N I T - I
BTECH_CSE-SEM 31
SVEC TIRUPATI
then producing an output of a colorized image. ChromaGAN is an
example of a picture colorization model. A generative network is framed
in an adversarial model that learns to colorize by incorporating a
perceptual and semantic understanding of both class distributions and
color.
To start with deep learning, the very basic project that you can build is to predict
the next digit in a sequence. Create a sequence like a list of odd numbers and then
build a model and train it to predict the next digit in the sequence. A simple neural
network with 2 layers would be sufficient to build the model.
The face detection took a major leap with deep learning techniques. We can
build models with high accuracy in detecting the bounding boxes of the human
face. This project will get you started with object detection and you will learn how
to detect any object in an image.
20|D L - U N I T - I
BTECH_CSE-SEM 31
SVEC TIRUPATI
How often do you get stuck thinking about the name of a dog’s breed? There are
many dog breeds and most of them are similar to each other. We can use the dog
breeds dataset and build a model that will classify different dog breeds from an
image. This project will be useful for a lot of people.
21|D L - U N I T - I
BTECH_CSE-SEM 31