Linear Algebra and Learning from Data
Multiplication Ax and AB
Column space of A
Independent rows and basis
Row rank = column rank
Neural Networks and Deep Learning / new course and book
math.mit.edu/learningfromdata
2 3 2x1 + 3x2
x
By rows 2 4 1 = 2x1 + 4x2
x2
3 7 3x1 + 7x2
2 3 2 3
x
By columns 2 4 1 = x1 2 + x2 4
x2
3 7 3 7
b = (b1 , b2 , b3 ) is in the column space of A exactly when
Axb = b has a solution (x1 , x2 )
1
b = 1 is not in C(A).
1
2x1 + 3x2 1
Ax = 2x1 + 4x2 = 1 is unsolvable.
3x1 + 7x2 1
1
The firsttwo
equations force x1 = 2 and x2 = 0.
Then: 3 21 +7(0) = 1.5 (not 1).
What are the column spaces of
2 3 5 2 3 1
A2 = 2 4 6 and A3 = 2 4 1 ?
3 7 10 3 7 1
If column 1 of A is not all zero, put it into C.
If column 2 of A is not a multiple of column 1, put it into C.
If column 3 of A is not a combination of columns 1 and 2, put it
into C. Continue.
At the end C will have r columns (r ≤ n).
They will be a “basis” for the column space of A.
The left out columns are combinations of those basic columns in
C.
1 3 8 1 3
If A = 1 2 6
then C = 1 2 n = 3 columns in A
r = 2 columns in C
0 1 2 0 1
1 2 3
n = 3 columns in A
If A = 0 4 5 then C = A.
r = 3 columns in C
0 0 6
1 2 5 1
If A = 1 2 5 then C = 1 n = 3 columns in A
r = 1 column in C
1 2 5 1
1 3 8 1 3
1 0 2
A= 1 2 6 = 1 2
= CR
0 1 2
0 1 2 0 1
All we are doing is to put the right numbers in R. Combinations
of the columns of C produce the columns of A. Then A = CR
stores this information as a matrix multiplication.
The number of independent columns equals the number of
independent rows
Look at A = CR by rows instead of columns. R has r rows.
Multiplying by C takes combinations. Since A = CR, we
get every row of A from the r rows of R. Those r rows are
independent — a basis for the row space of A.
Column-row multiplication of matrices
—– b∗1 —–
| |
.. ∗ ∗ ∗
AB = a1 . . . an = a1 b1 + a2 b2 + · · · + an bn .
.
| | —– b∗n —–
| {z }
sum of rank 1 matrices
The i, j entry of ak b∗k is aik bkj .
Pn
Add to find cij = k=1 aik bkj = row i · column j.
A = LU A = QR S = QΛQT A = XΛX −1 A = U ΣV T
Deep Learning by Neural Networks
1 Key operation Composition F = F3 (F2 (F1 (x0 )))
2 Key rule Chain rule for derivatives
3 Key algorithm Stochastic gradient descent
4 Key subroutine Backpropagation
5 Key nonlinearity ReLU(x) = max(x, 0) = ramp function
Theorem
Suppose we have N hyperplanes H1 , . . . , HN in m-dimensional
space Rm . Those come from N linear equations aTi x + bi = 0,
in other words from Ax = b. Then the number of regions
bounded by the N hyperplanes (including infinite regions) is
probably r(N, m) and certainly not more :
m
X N N N N
r(N, m) = = + + ··· + .
i 0 1 m
i=0
Thus N = 1 hyperplane in Rm produces 10 + 11 = 2 regions
(one fold). And N = 2 hyperplanes will produce 1 + 2 + 1 = 4
regions provided m ≥ 2. When m = 1 we have 2 folds in a line,
which only separates the line into r(2, 1) = 3 pieces.
The theorem will follow from the recursive formula
r(N, m) = r(N − 1, m) + r(N − 1, m − 1)
Figure: The r(2, 1) = 3 pieces of H create 3 new regions. Then the
count becomes r(3.2) = 4 + 3 = 7 flat regions in the continuous
piecewise linear surface z = F (x1 , x2 ).
Backpropagation: Reverse Mode Graph for Derivatives
of x2 (x + y)
∂F ∂F
Figure: Reverse-mode computation of the gradient ∂x , ∂y at
x = 2, y = 3.
AB first or BC first ? Compute (AB) C or A(BC) ?
AB = (m × n) (n × p) has mnp multiplications
First way
(AB)C = (m × p) (p × q) has mpq multiplications
BC = (n × p) (p × q) has npq multiplications
Second way
A(BC) = (m × n) (n × q) has mnq multiplications