0% found this document useful (0 votes)
74 views37 pages

DL Unit-1

This document provides information on a Deep Learning course unit at SVEC Tirupati. It includes the course objectives, prerequisites, syllabus, outcomes, lesson plan, and lecture notes. The objectives are to demonstrate major Deep Learning trends, build and train neural networks, implement efficient networks, analyze network architecture parameters, and apply Deep Learning to solve real-world problems. The syllabus covers topics like linear algebra, probability, optimization methods, and neural network types. The lesson plan lists the topics to be covered across 11 lectures over 3 weeks. Activity-based learning includes experiments using Google Colab and case studies from different domains.

Uploaded by

SYEDA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
74 views37 pages

DL Unit-1

This document provides information on a Deep Learning course unit at SVEC Tirupati. It includes the course objectives, prerequisites, syllabus, outcomes, lesson plan, and lecture notes. The objectives are to demonstrate major Deep Learning trends, build and train neural networks, implement efficient networks, analyze network architecture parameters, and apply Deep Learning to solve real-world problems. The syllabus covers topics like linear algebra, probability, optimization methods, and neural network types. The lesson plan lists the topics to be covered across 11 lectures over 3 weeks. Activity-based learning includes experiments using Google Colab and case studies from different domains.

Uploaded by

SYEDA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

SVEC TIRUPATI

COURSE MATERIAL

SUBJECT DEEP LEARNONG


(20A05703C)

UNIT 1

COURSE B.TECH

COMPUTER SCIENCE & ENGINEERING


DEPARTMENT
(20A05703c)

SEMESTER 4-1

Mrs. G T PRASANNA KUMARI


PREPARED BY
(Faculty Name/s) Mrs. N. DIVYA

Version V-1

PREPARED / REVISED DATE 20-08-2023

1|D L - U N I T - I

BTECH_CSE-SEM 4 1
SVEC TIRUPATI

TABLE OF CONTENTS – UNIT 1


S. NO CONTENTS PAGE NO.
1 COURSE OBJECTIVES 1
2 PREREQUISITES 1
3 SYLLABUS 1
4 COURSE OUTCOMES 1
5 CO - PO/PSO MAPPING 1
6 LESSON PLAN 2
7 ACTIVITY BASED LEARNING 2
8 LECTURE NOTES 6
1.1 INTRODUCTION TO LINEAR ALGEBRA 6
1.2 SCALAR , VECTORS, MATRICES, TENSORFLOW 6
1.3 MATRIX OPERATION 7
1.4 TYPES OF MATRICES 10
1.5 NORMS 12
1.6 EIGEN DECOMPOSITION 13
1.7 SINGULAR VALUE DECOMPOSITION 14
1.8 PRINCIPLE COMPONENT ANALYSIS 15
1.9 PROBABILITY AND INFORMATION THEORY: 17
1.10 RANDOM VARIABLES 18
1.11 PROBABILITY DISTRIBUTION 19

1.12 MARGINAL PROBABILITY 20

1.13 CONDITIONAL PROBABILITY 20

1.14 EXPECTATION,VARIANCE AND COVARIANCE 22

1.15 BAYE’S RULE 24

1.16 INFORMATION THEORY 25

1.17 NUMERICAL COMPUTATION: UNDERFLOW AND OVERFLOW 26

1.18 GRADIENT BASED OPTIMIZATION 27

1.19 CONSTRAINED BASED OPTIMIZATION 30


1.20 LINEAR LEAST SQUARES 32
9 PRACTICE QUIZ 32
2|D L - U N I T - I

BTECH_CSE-SEM 4 1
SVEC TIRUPATI
10 ASSIGNMENTS 34
11 PART A QUESTIONS & ANSWERS (2 MARKS QUESTIONS) 35
12 PART B QUESTIONS 35
13 SUPPORTIVE ONLINE CERTIFICATION COURSES 35
14 REAL TIME APPLICATIONS 35
15 CONTENTS BEYOND THE SYLLABUS 37
16 PRESCRIBED TEXT BOOKS & REFERENCE BOOKS 37
17 MINI PROJECT SUGGESTION 37

3|D L - U N I T - I

BTECH_CSE-SEM 4 1
SVEC TIRUPATI
1. Course Objectives
The objectives of this course is to
1. To demonstrate the major technology trends driving Deep Learning.
2. To build, train and apply fully connected neural networks.
3. To implement efficient neural networks.
4. To analyze the key parameters and hyper perameters in neural network’s
architecture.
5. To apply concepts of Deep Learning to solve real word problems.

2. Prerequisites
This course is intended for senior undergraduate and junior graduate students
who have a proper understanding of
 Python Programming Language
 Calculus
 Linear Algebra
 Probability Theory
Although it would be helpful, knowledge about classical machine learning is NOT
required.
3. Syllabus
UNIT I
Introduction to Linear Algebra: Scalars, Vectors, Matrices And Tensors, Matrix
Operations, Types Of Matrices, Norms, Eigen Decomposition, Single Value
Decomposition, Principal Component Analysis.
Probability and Information Theory: Random Variables, Probability Distribution,
Marginal Distribution, Conditional Probability, Expectations, Variance And
Covariance, Baye’s Rule, Information Theory.
Numerical Computation: Overflow and Underflow, Gradient-Based Optimization,
Constarint –Based Optimization, Linear Least Squares.

4. Course outcomes
1. Demonstrate the mathematical foundation of neural network.
2. Describe the machine learning basics.
3. Differentiate architecture of deep neural network.
4. Build the convolution neural network.
5. Build and Train RNN and LSTMs.

4|D L - U N I T - I

BTECH_CSE-SEM 4 1
SVEC TIRUPATI
5.Co-PO / PSO Mapping

Machine
PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 P10 PO11 PO12 PSO1 PSO2
Tools
CO1 3 2

CO2 3 2

CO3 3 3 2 2 3 2 2

CO4 3 3 2 2 3 2 2

CO5

6. Lesson Plan

Lecture No. Weeks Topics to be covered References


Introduction to Linear Algebra: Scalars, Vectors, Matrices And
1 T1
Tensors
2 Matrix Operations, Types Of Matrices, Norms T1, R1
1
3 Eigen Decomposition, Single Value Decomposition T1, R1

4 Principal Component Analysis. T1, R1


Probability and Information Theory: Random Variables, Probability
5 T1, R1
Distribution
6 Marginal Distribution, Conditional Probability T1, R1
2
7 Expectations, Variance And Covariance T1, R1

8 Baye’s Rule, Information Theory T1, R1

9 Numerical Computation: Overflow and Underflow T1, R1

10 3 Gradient-Based Optimization, Constarint –Based Optimization T1, R1

11 Linear Least Squares T1, R1

7. Activity Based Learning


1. DL course is associated with laboratory, different open-ended problem
statements are given for each student to carry out the experiments using
google colab tool. The foundations of Deep Learning, understand how to build
neural networks, and learn how to lead successful machine learning projects.
You will learn about Convolutional networks, RNNs, LSTM,etc.

5|D L - U N I T - I

BTECH_CSE-SEM 4 1
SVEC TIRUPATI
2. You will work on case studies from healthcare, autonomous driving, sign
language reading, music generation, and natural language processing. You
will master not only the theory, but also see how it is applied in industry.
8. Lecture Notes

1.1 INTRODUCTION TO LINEAR ALGEBRA


Introduction: Linear algebra is a branch of mathematics that is widely used
throughout science and engineering. However, because linear algebra is a form
of continuous rather than discrete mathematics, many computer scientists have
little experience with it. A good understanding of linear algebra is essential for
understanding and working with many machine learning algorithms, especially
deep learning algorithms.

1.2 Scalars, Vectors, Matrices and Tensor


The study of linear algebra involves several types of mathematical objects:

• Scalars: A scalar is just a single number, in contrast to most of the other objects
studied in linear algebra, which are usually arrays of multiple numbers. We write
scalars in italics. We usually give scalars lower-case variable names. When we
introduce them, we specify what kind of number they are.

For example, we might say “Let s ∈ R be the slope of the line,” while defining a
real-valued scalar, or “Let n ∈ N be the number of units,” while defining a natural
number scalar.

• Vectors: A vector is an array of numbers. The numbers are arranged in order.


We can identify each individual number by its index in that ordering. Typically
we give vectors lower case names written in bold typeface, such as x. The
elements of the vector are identified by writing its name in italic typeface, with a
subscript. The first element of x is x1, the second element is x2 and so on. We also
need to say what kind of numbers are stored in the vector. If each element is in
R, and the vector has n elements, then the vector lies in the set formed by taking
the Cartesian product of R n times, denoted as Rn. When we need to explicitly
identify the elements of a vector, we write them as a column enclosed in square
brackets:

X=[ x1

x2

...
6|D L - U N I T - I

BTECH_CSE-SEM 4 1
SVEC TIRUPATI
Xn ]

We can think of vectors as identifying points in space, with each element giving
the coordinate along a different axis.

Sometimes we need to index a set of elements of a vector. In this case, we


define a set containing the indices and write the set as a subscript. For example,
to access x1, x3 and x6, we define the set S = {1, 3, 6} and write xS.

We use the − sign to index the complement of a set. For example x−1 is the
vector containing all elements of x except for x1, and x−S is the vector
containing all of the elements of x except for x1, x3 and x6.

• Matrices: A matrix is a 2-D array of numbers, so each element is identified by


two indices instead of just one. We usually give matrices upper-case variable
names with bold typeface, such as A. If a real-valued matrix A has a height of m
and a width of n, then we say that A ∈ R m×n. We usually identify the elements of
a matrix using its name in italic but not bold font, and the indices are listed with
separating commas. For example, A1,1 is the upper left entry of A and Am,n is
the bottom right entry. We can identify all of the numbers with vertical
coordinate i by writing a “:” for the horizontal coordinate. For example, Ai,:
denotes the horizontal cross section of A with vertical coordinate i. This is known
as the i-th row of A. Likewise, A:,i is

A1,1 A 1,2 A1,1 A2,1 A3,1

A = A2,1 A2,2 ⇒ AT = A1,2 A2,2 A3,2

A3,1 A3,2

Figure : The transpose of the matrix can be thought of as a mirror image across
the main diagonal

the i-th column of A. When we need to explicitly identify the elements of a


matrix, we write them as an array enclosed in square brackets:

[ A1,1 A1,2

A2,1 A2,2 ]

Sometimes we may need to index matrix-valued expressions that are not just a
single letter. In this case, we use subscripts after the expression, but do not
convert anything to lower case. For example, f (A)i,j gives element (i, j) of the
matrix computed by applying the function f to A.
7|D L - U N I T - I

BTECH_CSE-SEM 4 1
SVEC TIRUPATI
Tensors: In some cases we will need an array with more than two axes. In the
general case, an array of numbers arranged on a regular grid with a variable
number of axes is known as a tensor. We denote a tensor named “A” with this
typeface: A. We identify the element of A at coordinates (i, j, k) by writing Ai,j,k.

1.3 MATRIX OPERATIONS

Operations on Matrices
Addition, subtraction and multiplication are the basic operations on the matrix.
To add or subtract matrices, these must be of identical order and for
multiplication, the number of columns in the first matrix equals the number of
rows in the second matrix.
 Addition of Matrices
 Subtraction of Matrices
 Scalar Multiplication of Matrices
 Multiplication of Matrices

Addition of Matrices
If A[aij]mxn and B[bij]mxn are two matrices of the same order, then their sum A + B is
a matrix, and each element of that matrix is the sum of the corresponding
elements. i.e. A + B = [aij + bij]mxn
Consider the two matrices A & B of order 2 x 2. Then the sum is given by:
[a1 b1 + [a2 b2 = [a1+a2 b1+b2
c1 d1] c2 d2] c1+c2 d1+d2]
Properties of Matrix Addition: If a, B and C are matrices of same order, then
(a) Commutative Law: A + B = B + A
(b) Associative Law: (A + B) + C = A + (B + C)
(c) Identity of the Matrix: A + O = O + A = A, where O is zero matrix which is
additive identity of the matrix,
(d) Additive Inverse: A + (-A) = 0 = (-A) + A, where (-A) is obtained by changing
the sign of every element of A which is additive inverse of the matrix,
(e) A+B=A+C }
B+A=C+A } ⇒ B=C
(f) tr(A ± B) = tr(A) ± tr(B)

8|D L - U N I T - I

BTECH_CSE-SEM 4 1
SVEC TIRUPATI
(g) If A + B = 0 = B + A, then B is called additive inverse of A and also A is called
the additive inverse of A.
Subtraction of Matrices
If A and B are two matrices of the same order, then we define A−B=A+(−B).
Consider the two matrices A & B of order 2 x 2. Then the difference is given by:
[a1 b1] – [a2 b2 [a1−a2 b1−b2
c1 d1] c2 d2] = c1−c2 d1−d2]
We can subtract the matrices by subtracting each element of one matrix from
the corresponding element of the second matrix. i.e. A – B = [aij – bij]mxn
Scalar Multiplication of Matrices
If A = [aij]m×n is a matrix and k any number, then the matrix which is obtained by
multiplying the elements of A by k is called the scalar multiplication of A by k
and it is denoted by k A thus if A = [aij]m×n. Then kAm×n=Am×nk=[kai×j]
Properties of Scalar Multiplication: If A, B are matrices of the same order and λ
and μ are any two scalars then;
(a) λ(A + B) = λA + λB
(b) (λ + μ)A = λA + μA
(c) λ(μA) = (λμA) = μ(λA)
(d) (-λA) = -(λA) = λ(-A)
(e) tr(kA) = k tr(A)
Multiplication of Matrices
If A and B be any two matrices, then their product AB will be defined only when
the number of columns in A is equal to the number of rows in B.
If A=[aij]m×n.and B=[bij]n×p then their product AB=C=[cij]m×p will be a matrix
of order m×p where (AB)ij=Cij=∑r=1nai r brj
Properties of matrix multiplication
(a) Matrix multiplication is not commutative in general, i.e. in general AB≠BA.
(b) Matrix multiplication is associative, i.e. (AB)C = A(BC).
(c) Matrix multiplication is distributive over matrix addition, i.e. A.(B + C) = A.B +
A.C and (A + B)C = AC + BC.

9|D L - U N I T - I

BTECH_CSE-SEM 4 1
SVEC TIRUPATI
(d) If A is an m × n matrix, then ImA=A=AIn.
(e) The product of two matrices can be a null matrix while neither of them is null,
i.e. if AB = 0, it is not necessary that either A = 0 or B = 0.
(f) If A is an m × n matrix and O is a null matrix then Am×n.On×p=Om×p. i.e. the
product of the matrix with a null matrix is always a null matrix.
(g) If AB = 0 (It does not mean that A = 0 or B = 0, again the product of two non-
zero matrices may be a zero matrix).
(h) If AB = AC , B ≠ C (Cancellation Law is not applicable).
(i) tr(AB) = tr(BA)
(j) There exist a multiplicative identity for every square matrix such AI = IA = A

1.4 TYPES OF MATRICES:

Divided into 6 parts to cover the main types of matrices; they are:

1. Square Matrix
2. Symmetric Matrix
3. Triangular Matrix
4. Diagonal Matrix
5. Identity Matrix
6. Orthogonal Matrix
Square Matrix
A square matrix is a matrix where the number of rows (n) equals the number of columns
(m).

n=m

The square matrix is contrasted with the rectangular matrix where the number of
rows and columns are not equal.
Given that the number of rows and columns match, the dimensions are usually
denoted as n, e.g. n x n. The size of the matrix is called the order, so an order 4
square matrix is 4 x 4.

The vector of values along the diagonal of the matrix from the top left to the
bottom right is called the main diagonal.
Below is an example of an order 3 square matrix.
1, 2, 3
10|D L - U N I T - I

BTECH_CSE-SEM 4 1
SVEC TIRUPATI
M = ( 1, 2, 3 )
1, 2, 3

Square matrices are readily added and multiplied together and are the basis of
many simple linear transformations, such as rotations (as in the rotations of
images).
Symmetric Matrix
A symmetric matrix is a type of square matrix where the top-right triangle is the
same as the bottom-left triangle.
To be symmetric, the axis of symmetry is always the main diagonal of the matrix,
from the top left to the bottom right.
Below is an example of a 5×5 symmetric matrix.
1, 2, 3, 4, 5
2, 1, 2, 3, 4
M = (3, 2, 1, 2, 3)
4, 3, 2, 1, 2
5, 4, 3, 2, 1

A symmetric matrix is always square and equal to its own transpose. M = M^T
Triangular Matrix
A triangular matrix is a type of square matrix that has all values in the upper-right or
lower-left of the matrix with the remaining elements filled with zero values.
A triangular matrix with values only above the main diagonal is called an upper
triangular matrix. Whereas, a triangular matrix with values only below the main
diagonal is called a lower triangular matrix.

Below is an example of a 3×3 upper triangular matrix.


1, 2, 3
M = ( 0, 2, 3)
0, 0, 3
Below is an example of a 3×3 lower triangular matrix.
1, 0, 0
M = (1, 2, 0)
1, 2, 3

Diagonal Matrix
A diagonal matrix is one where values outside of the main diagonal have a zero
value, where the main diagonal is taken from the top left of the matrix to the
bottom right.
11|D L - U N I T - I

BTECH_CSE-SEM 4 1
SVEC TIRUPATI
A diagonal matrix is often denoted with the variable D and may be represented as
a full matrix or as a vector of values on the main diagonal.
Below is an example of a 3×3 square diagonal matrix.
1, 0, 0
D = (0, 2, 0)
0, 0, 3
As a vector, it would be represented as: d = (1, 2, 3)

A diagonal matrix does not have to be square. In the case of a rectangular


matrix, the diagonal would cover the shortest dimension;
for example:
1, 0, 0, 0
0, 2, 0, 0
D= ( 0, 0, 3, 0)
0, 0, 0, 4
0, 0, 0, 0

Identity Matrix
An identity matrix is a square matrix that does not change a vector when
multiplied.
The values of an identity matrix are known. All of the scalar values along the
main diagonal (top-left to bottom-right) have the value one, while all other
values are zero.
For example, an identity matrix with the size 3 or I3 would be as follows:

1, 0, 0
I = (0, 1, 0)
0, 0, 1
Orthogonal Matrix
Two vectors are orthogonal when their dot product equals zero, called
orthonormal.
v . w = 0 or v . w^T = 0

1.5 NORMS

To understand what norms of a vectors are let us recall that vectors are an
ordered finite list of numbers like this:

12|D L - U N I T - I

BTECH_CSE-SEM 4 1
SVEC TIRUPATI
The vector x in this example has two elements, therefore we can easily plot the
vector in a 2D-Plane, as follows:

In the above plot the first element of the vector corresponds to the x-value
and the second element of the vector corresponds to the y-value. Nice to know
what the elements of the vector correspond to, but what are the following
attributes of a vector?

As you can see in the plot a vector is further characterized by its norm, which
is the distance of the vector from the origin at x,y = 0, and it’s angle. The norm is
calculated like this:

1.6. EIGENDECOMPOSITION

An eigenvector of a square matrix A is a non-zero vector v such that


multiplication by A alters only the scale of v: Av = λv The scalar λ is known as the
eigenvalue corresponding to this eigenvector. (One can also find a left

eigenvector such that vT A= λVT, but we are usually concerned with right
eigenvectors). If v is an eigenvector of A, then so is any rescaled vector sv for s ∈

13|D L - U N I T - I

BTECH_CSE-SEM 4 1
SVEC TIRUPATI
R, s = 0. Moreover, sv still has the same eigenvalue. For this reason, we usually
only look for unit eigenvectors. Suppose that a matrix A has n linearly
independent eigenvectors, {v(1) , . . . , v(n) }, with corresponding eigenvalues
{λ1, . . . , λn}. We may concatenate all of the eigenvalues and eigenvectors.

The eigendecomposition of a real symmetric matrix can also be used to


optimize quadratic expressions of the form f(x) = x Ax subject to ||x||2 = 1.
Whenever x is equal to an eigenvector of A, f takes on the value of the
corresponding eigenvalue. The maximum value of f within the constraint region
is the maximum eigenvalue and its minimum value within the constraint region is
the minimum eigenvalue.
A matrix whose eigenvalues are all positive is called positive definite. A matrix
whose eigenvalues are all positive or zero-valued is called positive semidefi-
nite. Likewise, if all eigenvalues are negative, the matrix is negative definite, and
if all eigenvalues are negative or zero-valued, it is negative semidefinite. Positive
semidefinite matrices are interesting because they guarantee that ∀x, x Ax ≥ 0.
Positive definite matrices additionally guarantee that x Ax = 0 ⇒ x = 0.
1.7. SINGULAR VALUE DECOMPOSITION
The singular value decomposition (SVD) provides another way to factorize a
matrix, into singular vectors and singular values. The SVD allows us to discover
some of the same kind of information as the eigendecomposition.
Every real matrix has a singular value decomposition, but the same is not true of
the eigenvalue decomposition. For example, if a matrix is not square, the
eigendecomposition is not defined, and we must use a singular value
decomposition instead. Recall that the eigendecomposition involves analyzing
a matrix A to discover a matrix V of eigenvectors and a vector of eigenvalues λ
such that we can rewrite A as A = V diag(λ)V −1 .
The singular value decomposition is similar, except this time we will write A as a
product of three matrices: A = UDVT .
Suppose that A is an m ×n matrix. Then U is defined to be an m ×m matrix, D to
be an m × n matrix, and V to be an n × n matrix. Each of these matrices is
defined to have a special structure. The matrices U and V are both defined to
be orthogonal matrices. The matrix D is defined to be a diagonal matrix. Note
that D is not necessarily square. The elements along the diagonal of D are
known as the singular values of the matrix A. The columns of U are known as the
left-singular vectors. The columns of V are known as as the right-singular vectors.
14|D L - U N I T - I

BTECH_CSE-SEM 4 1
SVEC TIRUPATI
We can actually interpret the singular value decomposition of A in terms of the
eigendecomposition of functions of A. The left-singular vectors of A are the
eigenvectors of AAT. The right-singular vectors of A are the eigenvectors of AT A.
The non-zero singular values of A are the square roots of the eigenvalues of AT A.
The same is true for AAT
The eigendecomposition of a matrix tells us many useful facts about the matrix.
The matrix is singular if and only if any of the eigenvalues are zero.

1.8. PRINCIPAL COMPONENTS ANALYSIS


One simple machine learning algorithm, principal components analysis or PCA
can be derived using only knowledge of basic linear algebra.

Suppose we have a collection of m points {x(1), . . . , x(m)} in Rn. Suppose we


would like to apply lossy compression to these points. Lossy compression means
storing the points in a way that requires less memory but may lose some
precision. We would like to lose as little precision as possible.

One way we can encode these points is to represent a lower-dimensional


version of them. For each point x(i) ∈ Rn we will find a corresponding code vector
c(i) ∈ Rl. If l is smaller than n, it will take less memory to store the code points than
the original data. We will want to find some encoding function that produces
the code for an input, f(x) = c, and a decoding function that produces the
reconstructed input given its code, x ≈ g(f (x)).

PCA is defined by our choice of the decoding function. Specifically, to make


the decoder very simple, we choose to use matrix multiplication to map the
code back

into Rn. Let g(c) = Dc, where D ∈ Rn×l is the matrix defining the decoding.

Computing the optimal code for this decoder could be a difficult problem. To
keep the encoding problem easy, PCA constrains the columns of D to be
orthogonal to each other.

To give the problem a unique solution, we constrain all of the columns of D to


have unit norm.

In order to turn this basic idea into an algorithm we can implement, the first thing
we need to do is figure out how to generate the optimal code point c∗ for each
input point x. One way to do this is to minimize the distance between the input

15|D L - U N I T - I

BTECH_CSE-SEM 4 1
SVEC TIRUPATI
point x and its reconstruction, g(c ∗). We can measure this distance using a
norm. In the principal components algorithm, we use the L2 norm:

c ∗ = arg min||x − g(c)||2.


c
We can switch to the squared L2 norm instead of the L2 norm itself, because
both are minimized by the same value of c. Both are minimized by the same
value of c because the L2 norm is non-negative and the squaring operation is

monotonically increasing for non-negative arguments.

c * = arg min||x − g(c)||2.

The function being minimized simplifies to

(x − g(c))*(x − g(c))

(by the definition of the L2 norm), equation


= x* x − x*g(c) − g(c) * x + g(c) *g(c)

(by the distributive property)

= x*x − 2x*g(c) + g(c) *g(c)

(because the scalar g(c) x is equal to the transpose of itself).

We can now change the function being minimized again, to omit the first term,
since this term does not depend on c:

16|D L - U N I T - I

BTECH_CSE-SEM 4 1
SVEC TIRUPATI
c* = arg min 2x*g(c) + g(c) *g(c).

To make further progress, we must substitute in the definition of g(c):

c* = arg min 2x*Dc + c*D*Dc

c

= arg min 2x* Dc + c*Il c c

(by the orthogonality and unit norm constraints on D)



= arg min 2x*Dc + c*c c

We can solve this optimization problem using vector calculus

∇ c(−2x*Dc + c*c) = 0

— 2D*x + 2c = 0

c = D*x.

This makes the algorithm efficient: we can optimally encode x just using a matrix-
vector operation. To encode a vector, we apply the encoder function

f (x) = D*x.

Using a further matrix multiplication, we can also define the PCA reconstruction
operation:

r(x) = g (f (x)) = DD*x.

Next, we need to choose the encoding matrix D. To do so, we revisit the idea of
minimizing the L2 distance between inputs and reconstructions.

1.9. PROBABILITY AND INFORMATION THEORY

Probability theory is a mathematical framework for representing uncertain statements.


It provides a means of quantifying uncertainty and axioms for deriving new uncertain
statements. In artificial intelligence applications, we use probability theory in two
major ways. First, the laws of probability tell us how AI systems should reason, so we
design our algorithms to compute or approximate various expressions derived using
probability theory. Second, we can use probability and statistics to theoretically
analyze the behavior of proposed AI systems.

While probability theory allows us to make uncertain statements and reason in the
presence of uncertainty, information theory allows us to quantify the amount of
uncertainty in a probability distribution.

1|D L - U N I T - I

BTECH_CSE-SEM 31
SVEC TIRUPATI
Why Probability?

There are three possible sources of uncertainty:

1. Inherent stochasticity in the system being modeled. For example, most


interpretations of quantum mechanics describe the dynamics of subatomic
particles as being probabilistic. We can also create theoretical scenarios that we
postulate to have random dynamics, such as a hypothetical card game where we
assume that the cards are truly shuffled into a random order.

2. Incomplete observability. Even deterministic systems can appear stochastic when


we cannot observe all of the variables that drive the behavior of the system. For
example, in the Monty Hall problem, a game show contestant is asked to choose
between three doors and wins a prize held behind the chosen door. Two doors
lead to a goat while a third leads to a car. The outcome given the contestant’s
choice is deterministic, but from the contestant’s point of view, the outcome is
uncertain.

3. Incomplete modeling. When we use a model that must discard some of the
information we have observed, the discarded information results in uncertainty in
the model’s predictions. For example, suppose we build a robot that can exactly
observe the location of every object around it. If the robot discretizes space when
predicting the future location of these objects, then the discretization makes the
robot immediately become uncertain about the precise position of objects: each
object could be anywhere within the discrete cell that it was observed to occupy.

Probability theory was originally developed to analyze the frequencies of events. It


is easy to see how probability theory can be used to study events like drawing a
certain hand of cards in a game of poker. These kinds of events are often
repeatable. When we say that an outcome has a probability p of occurring, it
means that if we repeated the experiment (e.g., draw a hand of cards) infinitely
many times, then proportion p of the repetitions would result in that outcome. This
kind of reasoning does not seem immediately applicable to propositions that are
not repeatable.

1.10. RANDOM VARIABLES

A random variable is a variable that can take on different values randomly. We


typically denote the random variable itself with a lower case letter in plain typeface,
and the values it can take on with lower case script letters. For example, x1 and x2
are both possible values that the random variable x can take on. For vector-valued
variables, we would write the random variable as x and one of its values as x. On its
own, a random variable is just a description of the states that are possible; it must be
coupled with a probability distribution that specifies how likely each of these states
are.

2|D L - U N I T - I

BTECH_CSE-SEM 31
SVEC TIRUPATI
Random variables may be discrete or continuous. A discrete random variable is one
that has a finite or countably infinite number of states. Note that these states are not
necessarily the integers; they can also just be named states that are not considered
to have any numerical value. A continuous random variable is associated with a real
value.
1.11. PROBABILITY DISTRIBUTIONS
A probability distribution is a description of how likely a random variable or set of
random variables is to take on each of its possible states. The way we describe
probability distributions depends on whether the variables are discrete or continuous.

• Discrete Variables and Probability Mass Functions

A probability distribution over discrete variables may be described using a proba-


bility mass function (PMF). We typically denote probability mass functions with a
capital P . Often we associate each random variable with a different probability
mass function and the reader must infer which probability mass function to use
based on the identity of the random variable, rather than the name of the function;
P (x) is usually not the same as P (y).

The probability mass function maps from a state of a random variable to the
probability of that random variable taking on that state. The probability that x = x is
denoted as P (x), with a probability of 1 indicating that x = x is certain and a
probability of 0 indicating that x = x is impossible. Sometimes to disambiguate which
PMF to use, we write the name of the random variable explicitly: P (x = x). Sometimes
we define a variable first, then use ∼ notation to specify which distribution it follows
later: x ∼ P (x).

Probability mass functions can act on many variables at the same time. Such a
probability distribution over many variables is known as a joint probability distribution.
P (x = x, y = y ) denotes the probability that x = x and y = y simultaneously. We may
also write P (x, y) for brevity. To be a probability mass function on a random variable
x, a function P must satisfy the following properties:

• The domain of P must be the set of all possible states of x.

• ∀x ∈ x,0 ≤ P (x) ≤ be less probable 1. An impossible event has probability 0 and no


state can have the probability 1, and no state can have ea greater chance of
occurring.

• P (x) = 1. We refer to this property as being normalized. Without this property, we


x x
could obtain probabilities greater than one by computing the probability of one of
many events occurring

• For example, consider a single discrete random variable x with k different


states. We can place a uniform distribution on x—that is, make each of its
3|D L - U N I T - I

BTECH_CSE-SEM 31
SVEC TIRUPATI
states equally likely—by setting its probability mass function to
P (x = xi) =1/k

1.12. MARGINAL PROBABILITY

Sometimes we know the probability distribution over a set of variables and we


want to know the probability distribution over just a subset of them. The probability
distribution over the subset is known as the marginal probability distribution.

For example, suppose we have discrete random variables x and y, and we know

P (x, y). We can find P (x) with the sum rule:

∀ x ∈ x, P (x = x) = P (x = x, y = y).

The name “marginal probability” comes from the process of computing marginal
probabilities on paper. When the values of P (x, y ) are written in a grid with diff erent
values of x in rows and diff erent values of y in columns, it is natural to sum across a
row of the grid, then write P(x) in the margin of the paper just to the right of the row.

For continuous variables, we need to use integration instead of summation:

p(x) = p(x, y)dy

1.13. CONDITIONAL PROBABILITY

In many cases, we are interested in the probability of some event, given that some
other event has happened. This is called a conditional probability. We denote the
conditional probability that y = y given x = x as P(y = y | x = x).

This conditional probability can be computed with the formula

The conditional probability is only defined when P(x = x) > 0. We cannot compute
the conditional probability conditioned on an event that never happens.

It is important not to confuse conditional probability with computing what would


happen if some action were undertaken. The conditional probability that a person
is from Germany given that they speak German is quite high, but if a randomly
selected person is taught to speak German, their country of origin does not
change. Computing the consequences of an action is called making an
intervention query.

4|D L - U N I T - I

BTECH_CSE-SEM 31
SVEC TIRUPATI
The Chain Rule of Conditional Probabilities

Any joint probability distribution over many random variables may be decomposed
into conditional distributions over only one variable:

This observation is known as the chain rule or product rule of probability. It follows
immediately from the definition of conditional probability in equation

For example, applying the definition twice, we get

Independence and Conditional Independence

Two random variables x and y are independent if their probability distribution can
be expressed as a product of two factors, one involving only x and one involving
only y:

Two random variables x and y are conditionally independent given a random


variable z if the conditional probability distribution over x and y factorizes in this way
for every value of z:

∀x ∈ x, y ∈ y, z ∈ z, p(x = x, y = y | z = z) = p(x = x | z = z)p(y = y | z = z)

We can denote independence and conditional independence with compact


notation: x⊥y means that x and y are independent, while x⊥y | z means that x and
y are conditionally independent given z.

1.14. EXPECTATIONS, VARIANCE AND COVARIANCE

EXPECTATIONS

The expectation or expected value of some function f(x) with respect to a


probability distribution P (x) is the average or mean value that f takes on when x is
drawn from P . For discrete variables this can be computed with a summation:

5|D L - U N I T - I

BTECH_CSE-SEM 31
SVEC TIRUPATI

Ex∼P [f(x)] = P (x)f(x),

while for continuous variables, it is computed with an integral:

Ex∼p[f(x)] = p(x)f(x)dx.

When the identity of the distribution is clear from the context, we may simply write
the name of the random variable that the expectation is over, as in Ex[f(x)]. If it is
clear which random variable the expectation is over, we may omit the subscript
entirely, as in E[f (x)]. By default, we can assume that E[·] averages over the values
of all the random variables inside the brackets. Likewise, when there is no
ambiguity, we may omit the square brackets.

Expectations are linear, for example,

Ex[αf(x) + βg(x)] = αEx[f(x)] + βEx[g(x)],

when α and β are not dependent on x.

VARIANCE

The variance gives a measure of how much the values of a function of a random
variable x vary as we sample different values of x from its probability distribution:

Var(f (x)) = E (f (x) − E[f (x)])2 .

When the variance is low, the values of f (x) cluster near their expected value. The
square root of the variance is known as the standard deviation

COVARIANCE

The covariance gives some sense of how much two values are linearly related to
each other, as well as the scale of these variables:

Cov(f(x), g(y)) = E [(f(x) − E [f(x)]) (g(y) − E [g(y)])]

High absolute values of the covariance mean that the values change very much
and are both far from their respective means at the same time. If the sign of the
covariance is positive, then both variables tend to take on relatively high values
simultaneously. If the sign of the covariance is negative, then one variable tends to
take on a relatively high value at the times that the other takes on a relatively low
value and vice versa. Other measures such as correlation normalize the
contribution of each variable in order to measure only how much the variables are
related, rather than also being affected by the scale of the separate variables.

6|D L - U N I T - I

BTECH_CSE-SEM 31
SVEC TIRUPATI
The notions of covariance and dependence are related, but are in fact distinct
2
concepts. They are related because two variables that are independent have zero
covariance, and two variables that have non-zero covariance are dependent.
How- ever, independence is a distinct property from covariance. For two variables
to have zero covariance, there must be no linear dependence between them.
Independence is a stronger requirement than zero covariance, because
independence also excludes nonlinear relationships. It is possible for two variables
to be dependent but have zero covariance. For example, suppose we first sample
a real number x from a uniform distribution over the interval [−1, 1]. We next sample
a random variables. With probability 1 , we choose the value of s to be 1.
Otherwise, we choose the value of s to be −1. We can then generate a random
variable y by assigning y = sx. Clearly, x and y are not independent, because x
completely determines the magnitude of y. However, Cov(x, y) = 0.

The covariance matrix of a random vector x ∈ Rn is an n × n matrix, such that

Cov(x)i,j = Cov(xi, xj).

The diagonal elements of the covariance give the variance:

Cov(xi, xi) = Var(xi ).

1.15. BAYE’S RULE

We often find ourselves in a situation where we know P (y | x) and need to know P


(x | y). Fortunately, if we also know P (x), we can compute the desired quantity
using Bayes’ rule:
P (x)P (y | x)
P (x y) = .
|
P (y)
Note that while P (y) appears in the formula, it is usually feasible to compute

P (y) = x P (y | x)P (x), so we do not need to begin with knowledge of P (y).

Bayes’ rule is straightforward to derive from the definition of conditional probability,


but it is useful to know the name of this formula since many texts refer to it by name.
It is named after the Reverend Thomas Bayes, who first discovered a special case of
the formula. The general version presented here was independently discovered by
Pierre-Simon Laplace.

7|D L - U N I T - I

BTECH_CSE-SEM 31
SVEC TIRUPATI
1.16. INFORMATION THEORY

Information theory is a branch of applied mathematics that revolves around


quantifying how much information is present in a signal. It was originally invented to
study sending messages from discrete alphabets over a noisy channel, such as
communication via radio transmission. In this context, information theory tells how to
design optimal codes and calculate the expected length of messages sampled
from specific probability distributions using various encoding schemes. In the
context of machine learning, we can also apply information theory to continuous
variables where some of these message length interpretations do not apply. This
field is fundamental to many areas of electrical engineering and computer science

We would like to quantify information in a way that formalizes this intuition.

Specifically,

 Likely events should have low information content, and in the extreme case,
events that are guaranteed to happen should have no information content
whatsoever.
 Less likely events should have higher information content.
 Independent events should have additive information. For example, finding
out that a tossed coin has come up as heads twice should convey twice as
much information as finding out that a tossed coin has come up as heads
once.

In order to satisfy all three of these properties, we define the self-information

of an event x = x to be

I(x) = − log P (x).

we always use log to mean the natural logarithm, with base e. Our definition of I (x)
is therefore written in units of nats .

Self-information deals only with a single outcome. We can quantify the amount of
uncertainty in an entire probability distribution using the Shannon entropy:

H(x) = Ex∼P [I(x)] = −Ex∼P [log P (x)].

also denoted H(P ). In other words, the Shannon entropy of a distribution is the
expected amount of information in an event drawn from that distribution. It gives a
lower bound on the number of bits (if the logarithm is base 2, otherwise the units are
different) needed on average to encode symbols drawn from a distribution P.
Distributions that are nearly deterministic (where the outcome is nearly certain)
have low entropy; distributions that are closer to uniform have high entropy When x
is continuous, the Shannon entropy is known as the differential entropy.

8|D L - U N I T - I

BTECH_CSE-SEM 31
SVEC TIRUPATI

If we have two separate probability distributions P (x) and Q (x) over the same
random variable x, we can measure how different these two distributions are using
the Kullback-Leibler (KL) divergence:

Shannon entropy in nats

0.7
0.7

0.6
0.6

0.5
0.5

0.4 0. 0. 0. 0. 1.
0.4 2 4 6 8 0
0 0 0 0 1
. . . . .
Figure : This plot shows how distributions
2 4 that
6 are
8 closer
0 to
deterministic have low
Shannon entropy while 0.3 distributions that are close to uniform have high Shannon
0.3
entropy. On the horizontal axis, we plot p , the probability of a binary random
variable being equal to 1. The — entropy−is given
− by (p 1) log(1 p) p log p. When p
0.2
is near 0, the distribution0.2 is nearly deterministic, because the random variable is
nearly always 0. When p is near 1, the distribution is nearly deterministic, because
the random variable 0.1 is nearly always 1. When p = 0.5, the entropy is maximal,
0.1
because the distribution is uniform over the two outcomes.

A quantity that is closely


0.0 related to the KL divergence is the cross-entropy H(P, Q) =
0.0
H (P ) + DKL (P Q), which0 is similar to the KL divergence but lacking the term on
0
the left: .
.
0
0
H(P, Q) = −Ex∼P log Q(x).

1.17. NUMERICAL COMPUTATION

Machine learning algorithms usually require a high amount of numerical


compu- tation. This typically refers to algorithms that solve mathematical problems
by methods that update estimates of the solution via an iterative process, rather
than analytically deriving a formula providing a symbolic expression for the correct
so- lution. Common operations include optimization (finding the value of an
argument that minimizes or maximizes a function) and solving systems of linear
equations. Even just evaluating a mathematical function on a digital computer can
be difficult when the function involves real numbers, which cannot be represented
precisely using a finite amount of memory.

9|D L - U N I T - I

BTECH_CSE-SEM 31
SVEC TIRUPATI

Overflow and Underflow

The fundamental difficulty in performing continuous math on a digital computer is


that we need to represent infinitely many real numbers with a finite number of bit
patterns. This means that for almost all real numbers, we incur some approximation
error when we represent the number in the computer. In many cases, this is just
rounding error. Rounding error is problematic, especially when it compounds
across many operations, and can cause algorithms that work in theory to fail in
practice if they are not designed to minimize the accumulation of rounding error.

underflow.

One form of rounding error that is particularly devastating is underflow. Underflow


occurs when numbers near zero are rounded to zero. Many functions behave
qualitatively differently when their argument is zero rather than a small positive
number.

For example, we usually want to avoid division by zero (some software


environments will raise exceptions when this occurs, others will return a result with a
placeholder not-a-number value) or taking the logarithm of zero (this is usually
treated as −∞, which then becomes not-a-number if it is used for many further
arithmetic operations).
overflow.

Another highly damaging form of numerical error is overflow. Overflow occurs


when numbers with large magnitude are approximated as ∞ or −∞. Further
arithmetic will usually change these infinite values into not-a-number values.

One example of a function that must be stabilized against underflow and overflow
is the softmax function. The softmax function is often used to predict the
probabilities associated with a multinoulli distribution.

Poor Conditioning

Conditioning refers to how rapidly a function changes with respect to small


changes in its inputs. Functions that change rapidly when their inputs are
perturbed slightly can be problematic for scientific computation because
rounding errors in the inputs can result in large changes in the output.

1.18. GRADIENT-BASED OPTIMIZATION

Most deep learning algorithms involve optimization of some sort. Optimization


refers to the task of either minimizing or maximizing some function f (x) by altering

x. We usually phrase most optimization problems in terms of minimizing f (x).


Maximization may be accomplished via a minimization algorithm by minimizing
10|D L - U N I T - I

BTECH_CSE-SEM 31
SVEC TIRUPATI
−f (x).

The function we want to minimize or maximize is called the objective func- tion or
criterion. When we are minimizing it, we may also call it the cost function, loss
function, or error function. In this book, we use these terms interchangeably,
though some machine learning publications assign special meaning to some of
these terms
We often denote the value that minimizes or maximizes a function with a
superscript . For example, we might say x = arg min f(x).

Figure : An illustration of how the gradient descent algorithm uses the derivatives of a
function can be used to follow the function downhill to a minimum.

We assume the reader is already familiar with calculus, but provide a brief review
of how calculus concepts relate to optimization here.
Suppose we have a function y = f (x), where both x and y are real numbers.
The derivative of this function is denoted as f(x) or as dx
dy
. The derivative f (x)
gives the slope of f (x) at the point x. In other words, it specifies how to
scale a small change in the input in order to obtain the corresponding change
in the output:
f (x + ) ≈ f (x) + f (x).

When f (x) = 0, the derivative provides no information about which direction


to move. Points where f (x) = 0 are known as critical points or stationary
points. A local minimum is a point where f (x) is lower than at all neighboring
points, so it is no longer possible to decrease f(x) by making infinitesimal steps.
A local maximum is a point where f (x) is higher than at all neighboring points,
so it is not possible to increase f (x) by making infinitesimal steps. Some critical
points are neither maxima nor minima. These are known as saddle points.
11|D L - U N I T - I

BTECH_CSE-SEM 31
SVEC TIRUPATI
See the following figure

Minimum maximum saddle point

A point that obtains the absolute lowest value of f (x) is a global minimum. It is
possible for there to be only one global minimum or multiple global minima of the
function. It is also possible for there to be local minima that are not globally
optimal. In the context of deep learning, we optimize functions that may have
many local minima that are not optimal, and many saddle points surrounded by
very flat regions. All of this makes optimization very difficult, especially when the
input to the function is multidimensional. We therefore usually settle for finding a
value of f that is very low, but not necessarily minimal in any formal sense. See
figure for an example.

Figure : Optimization algorithms may fail to find a global minimum when there are
multiple local minima or plateaus present. In the context of deep learning, we
generally accept such solutions even though they are not truly minimal, so long
as they correspond to significantly low values of the cost function.

12|D L - U N I T - I

BTECH_CSE-SEM 31
SVEC TIRUPATI

Figure : A saddle point containing both positive and negative curvature.


1 2
The function in this example is f (x) = x2 − x . Along the axis corresponding to x1,
2

the function curves upward. This axis is an eigenvector of the Hessian and has a
positive eigenvalue. Along the axis corresponding to x2, the function curves
downward. This direction is an eigenvector of the Hessian with negative
eigenvalue. The name “saddle point” derives from the saddle-like shape of this
function. This is the quintessential example of a function with a saddle point.
In more than one dimension, it is not necessary to have an eigenvalue of 0 in
order to get a saddle point: it is only necessary to have both positive and
negative eigenvalues. We can think of a saddle point with both signs of
eigenvalues as being a local maximum within one cross section and a local
minimum within another cross section.

1.19. CONSTRAINED OPTIMIZATION

Sometimes we wish not only to maximize or minimize a function f(x) over all
possible values of x. Instead we may wish to find the maximal or minimal value of f
(x) for values of x in some set S. This is known as constrained optimization. Points x
that lie within the set S are called feasible points in constrained optimization
terminology.
We often wish to find a solution that is small in some sense. A common approach
in such situations is to impose a norm constraint, such as ||x|| ≤ 1.
The Karush–Kuhn–Tucker (KKT) approach provides a very general so- lution to
constrained optimization. With the KKT approach, we introduce a new function
called the generalized Lagrangian or generalized Lagrange function.

To define the Lagrangian, we first need to describe S in terms of equations and


inequalities. We want a description of S in terms of m functions g(i) and n functions
h (j) so that S = {x | ∀i, g (i)(x) = 0 and ∀j, h(j) (x) ≤ 0}. The equations involving
g(i) are called the equality constraints and the inequalities involving.

We introduce new variables λi and α j for each constraint, these are called the KKT
multipliers. The generalized Lagrangian is then defined as

13|D L - U N I T - I

BTECH_CSE-SEM 31
SVEC TIRUPATI

L(x, λ, α) = f (x) + λi g(i)(x) + αjh(j)(x).

We can now solve a constrained minimization problem using unconstrained


optimization of the generalized Lagrangian. Observe that, so long as at least
one feasible point exists and f (x) is not permitted to have value ∞, then

x λ α,α≥0

has the same optimal objective function value and set of optimal points x as
min f (x).
x∈ S

This follows because any time the constraints are satisfied

= f (x),

while any time a constraint is violated

=∞
To perform constrained maximization, we can construct the generalized La- grange
function of −f (x), which leads to this optimization problem:

Min max max −f (x) + λig(i)(x) + αjh(j)(x).


x λ α,α≥0 I j

A simple set of properties describe the optimal points of constrained opti- mization
problems. These properties are called the Karush-Kuhn-Tucker (KKT) conditions
They are necessary conditions, but not always sufficient conditions, for a point to be
optimal. The conditions are:
 The gradient of the generalized Lagrangian is zero.
 All constraints on both x and the KKT multipliers are satisfied.
 The inequality constraints exhibit “complementary slackness”: α h(x) = 0

1.20. LINEAR LEAST SQUARES

Suppose we want to find the value of x that minimizes

1
f (x) = ||Ax − b||2.

14|D L - U N I T - I

BTECH_CSE-SEM 31
SVEC TIRUPATI

There are specialized linear algebra algorithms that can solve this problem
efficiently. However, we can also explore how to solve it using gradient-based
optimization as a simple example of how these techniques work.
First, we need to obtain the gradient:

∇x f (x) = A (Ax − b) = A Ax − A b.

Algorithm
An algorithm to minimize f(x) = 1 ||Ax − b||2 with respect to x
using gradient descent, starting from an arbitrary value of x.
Set the step size ( ) and tolerance (δ) to small, positive numbers.
while ||A Ax − A b||2 > δ do
x ← x− A Ax − A b

endend
while
while

9. Practice Quiz
1. The inverse of the matrix is possible only for
a) Non-singular matrix
b) Singular matrix
c) Zero Matrix
d) Symetric Matrix
2. Suppose that price of 2 ball and 1 bat is 100 units, then What will be
representation of problems in Linear Algebra in the form of x and y?
a) 2x + y = 100
b. 2x + 2y = 100
c. 2x + y = 200
d. x + y = 100
3. What is the first step in linear algebra?
A. Let's complicate the problem
B. Solve the problem
C. Visualise the problem
D. None Of the above
4. Which of the following is not a type of matrix?
A. Square Matrix
B. Scalar Matrix
C. Trace Matrix
D. Term Matrix
5 The matrix which is the sum of all the diagonal elements of a square
matrix?
a)Diagonal matrix
b)Trace matrix
15|D L - U N I T - I

BTECH_CSE-SEM 31
SVEC TIRUPATI
c)both
d)Identity matrix
6 Multiplication of a matrix with a scalar constant is called?
A.Complex multiplication
B. Linear multiplication
C. Scalar multiplication
D. Constant multiplication
7. Which of the following is correct method to solve matrix equations?
A. Row Echelon Form
B. Inverse of a Matrix
C. Both A and B
D. None Of the above
8. ______________ is equal to the maximum number of linearly independent
row vectors in a matrix.
A. Row matrix
B. Rank of a matrix
C. Term matrix
D. Linear matrix
9. Vectors whose direction remains unchanged even after applying linear
transformation with the matrix are called?
A. Eigenvalues
B. Eigenvectors
C. Cofactor matrix
D. Minor of a matrix
10. The concept of Eigen values and vectors is applicable to?
A. Scalar matrix
B. Identity matrix
C. Upper triangular matrix
D. Square matrix
11. Singular matrix are?

A. non-invertible
B. invertible
C. Both non-invertible and invertible
D. None Of the above

12. Singular Value Decomposition is some sort of generalisation of __________


decomposition.

A. Singular
B. Eigen vector
C. Eigen value
D. None Of the above

16|D L - U N I T - I

BTECH_CSE-SEM 31
SVEC TIRUPATI

10. Assignments

S.No Question BL CO
Discuss the five types of matrices.
1 6 2
Write and explain in detail about principal component analysis
2 5 1
with suitable examples.
3 Explain in detail about the Baye’s rule 5 1
Compare expectation, variance and covariance
4 2 2
Discuss in detail about the gradient based optimization and
5 constrained based optimization. 6 2

11. Part A- Question & Answers

S.No Question& Answers BL CO


1 What is principal component analysis?
Ans. Principal component analysis, or PCA, is a statistical
procedure that allows you to summarize the information
1 1
content in large data tables by means of a smaller set of
“summary indices” that can be more easily visualized and
analyzed.
2 What is the difference between eigenvalues and eigen vectors ?
1. Ans: Eigenvectors are the directions along which a
particular linear transformation acts by flipping,
compressing or stretching. Eigenvalue can be referred to 1 1
as the strength of the transformation in the direction of
eigenvector or the factor by which the compression
occurs
3 Why Baye’s rule is used?
Ans. Bayes' Rule lets you calculate the posterior (or "updated")
probability. This is a conditional probability. It is the
1 1
probability of the hypothesis being true, if the evidence is
present. Think of the prior (or "previous") probability as your
belief in the hypothesis before seeing the new evidence
4 What are the types of least square method?
Ans. Generally speaking, Least-Squares Method has two
categories, linear and non-linear. We can also classify these
methods further: ordinary least squares (OLS), 1 1
weighted least squares (WLS), and
alternating least squares (ALS) and
partial least squares (PLS).
5 What is least square method formula? 1 1
17|D L - U N I T - I

BTECH_CSE-SEM 31
SVEC TIRUPATI
Ans. Suppose when we have to determine the equation of line
of best fit for the given data, then we first use the following
formula.
The equation of least square line is given by Y = a + bX.
Normal equation for 'a':
∑Y = na + b∑X.
Normal equation for 'b':
∑XY = a∑X + b∑X2

6 What is overflow and underflow?


Ans.Overflow and Underflow. Simply put, overflow and
underflow happen when we assign a value that is out of
1 1
range of the declared data type of the variable. If the
(absolute) value is too big, we call it overflow, if the value is
too small, we call it underflow.

12. Part B- Questions

S.No Question BL CO
1 Write and explain in detail about principal component analysis 2 1
with suitable examples.
2 Explain in detail about the Baye’s rule 2 1
3 Compare expectation, variance and covariance 2 4

4 Discuss in detail about the gradient based optimization and 6 2


constrained based optimization.

13. Supportive Online Certification Courses


1. CS231n: Convolutional Neural Networks for Visual Recognition, Stanford
2. CS224d: Deep Learning for Natural Language Processing, Stanford
3. CS285: Deep Reinforcement Learning, Berkeley
4. MIT 6.S094: Deep Learning for Self-Driving Cars, MIT
5. Neural networks and Deep learning By Andrew Ng, conducted by Coursera –
4weeks
6. Tensorflow for deep learning By Dr Kevin Webster, conducted by coursera – 6
months
7. Deep learning NPTEL course By Prof. Sudarshan Iyengar, Prof. Sanatan Sukhija-IIT
Ropar-10 weeks

18|D L - U N I T - I

BTECH_CSE-SEM 31
SVEC TIRUPATI
14. Real Time Applications

S.No Application CO

1 Virtual Assistants 1
Virtual Assistants are cloud-based applications that understand natural
language voice commands and complete tasks for the user. Amazon
Alexa, Cortana, Siri, and Google Assistant are typical examples of virtual
assistants. They need internet-connected devices to work with their full
capabilities. Each time a command is fed to the assistant, they tend to
provide a better user experience based on past experiences using Deep
Learning algorithms.
2 Chatbots 1
Chatbots can solve customer problems in seconds. A chatbot is an AI
application to chat online via text or text-to-speech. It is capable of
communicating and performing actions similar to a human. Chatbots are
used a lot in customer interaction, marketing on social network sites, and
instant messaging the client. It delivers automated responses to user
inputs. It uses machine learning and deep learning algorithms to
generate different types of reactions.
The next important deep learning application is related to Healthcare.
3 Healthcare 1
Deep Learning has found its application in the Healthcare sector.
Computer-aided disease detection and computer-aided diagnosis have
been possible using Deep Learning. It is widely used for medical research,
drug discovery, and diagnosis of life-threatening diseases such as cancer
and diabetic retinopathy through the process of medical imaging.
4 Entertainment 1
Companies such as Netflix, Amazon, YouTube, and Spotify give relevant
movies, songs, and video recommendations to enhance their customer
experience. This is all thanks to Deep Learning. Based on a person’s
browsing history, interest, and behavior, online streaming companies give
suggestions to help them make product and service choices. Deep
learning techniques are also used to add sound to silent movies and
generate subtitles automatically.
5 News Aggregation and Fake News Detection 1
Deep Learning allows you to customize news depending on the readers’
persona. You can aggregate and filter out news information as per
social, geographical, and economic parameters and the individual
preferences of a reader. Neural Networks help develop classifiers that
can detect fake and biased news and remove it from your feed. They
also warn you of possible privacy breaches.
6 Image Coloring 1
Image colorization has seen significant advancements using Deep
Learning. Image colorization is taking an input of a grayscale image and
19|D L - U N I T - I

BTECH_CSE-SEM 31
SVEC TIRUPATI
then producing an output of a colorized image. ChromaGAN is an
example of a picture colorization model. A generative network is framed
in an adversarial model that learns to colorize by incorporating a
perceptual and semantic understanding of both class distributions and
color.

15. Contents Beyond the Syllabus

1. Building Generative Adversarial Networks


Become familiar with generative adversarial networks (GANs) by learning how to
build and train different GANs architectures to generate new images. Discover,
build, and train architectures such as DCGAN, CycleGAN, ProGAN, and StyleGAN
on diverse datasets including the MNIST dataset, Summer2Winter Yosemite
dataset, or CelebA dataset.

16. Prescribed Text Books & Reference Books


Text Book
1. Ian Goodfellow, Yoshua bengio, Aaron Courville, “Deep learning”, MIT Press, 2016.
2. Josh Patterson and Adam Gibson, “Deep learning: A practitioner’s approach”,
O’Reilly Media, first edition, 2017.
References:
1. “Fundamentals of Deep learning, Designing next generation machine intelligence
algorithms”, Nikhil Buduma, o’Reilly, Shroff Publishers,2019.
2. “Deep Learning Cook Book”, Practical recipes to get started Quickly,
DouweOsinga, O’Reilly, Shroff Publishers,2019

17. Mini Project Suggestion

1. Predict Next Sequence

To start with deep learning, the very basic project that you can build is to predict
the next digit in a sequence. Create a sequence like a list of odd numbers and then
build a model and train it to predict the next digit in the sequence. A simple neural
network with 2 layers would be sufficient to build the model.

2. Human Face Detection

The face detection took a major leap with deep learning techniques. We can
build models with high accuracy in detecting the bounding boxes of the human
face. This project will get you started with object detection and you will learn how
to detect any object in an image.

3. Dog’s Breed Identification

20|D L - U N I T - I

BTECH_CSE-SEM 31
SVEC TIRUPATI
How often do you get stuck thinking about the name of a dog’s breed? There are
many dog breeds and most of them are similar to each other. We can use the dog
breeds dataset and build a model that will classify different dog breeds from an
image. This project will be useful for a lot of people.

21|D L - U N I T - I

BTECH_CSE-SEM 31

You might also like