0% found this document useful (0 votes)
44 views

Matrix Introduction

The document discusses matrix operations and properties relevant to simple linear regression. It defines what a matrix is and how it can be represented. It covers key matrix operations like transpose, multiplication, inverse and special matrices. It also discusses matrix properties such as rank and how to calculate variance-covariance matrices for random vectors and matrices.

Uploaded by

Ice Ice cold
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views

Matrix Introduction

The document discusses matrix operations and properties relevant to simple linear regression. It defines what a matrix is and how it can be represented. It covers key matrix operations like transpose, multiplication, inverse and special matrices. It also discusses matrix properties such as rank and how to calculate variance-covariance matrices for random vectors and matrices.

Uploaded by

Ice Ice cold
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Matrix Approach to Simple Linear Regression

Junhui Wang

Department of Statistics

Fall 2022
Matrix

A matrix is a rectangular array of elements arranged in rows


and columns.
The dimension of the matrix is n × p, where n is the number
of rows and p is the number of columns.

A matrix with n rows and p columns is usually represented using


boldface letters, say A, which can be represented as
 
a11 a12 · · · a1j · · · a1p
 .. .. .. .. 
 . . . . 
 
A=  ai1 ai2 · · · aij · · · aip 

 .. .. .. .. 
 . . . . 
an1 an2 · · · anj · · · anp
A can also be written in a compact form,

A = aij i=1,··· ,n;j=1,··· ,p

where aij is the element in the i-th row and j-th column.

Two matrices A = B, iff all of their corresponding elements are


equal; i.e., aij = bij for all i and j.

When n = p, A is a square matrix; when p = 1, A is a column


vector or simply a vector; when n = 1, A is a row vector.
  
y1 x1
column vector:  y2  ,  x2  , row vector: [1, x1 ]
y3 x3
 
 P3  1 x1
3 x
square matrix: P3 P3i=1 2i , (design) matrix:  1 x2 
i=1 xi i=1 xi 1 x3
Matrix transpose

For A = aij , its transpose is A′ = aji i=1,··· ,n;j=1,··· ,p


 

 
y1
 y2 
column vector:  ,
 
..
 . 
yn
 
(transpose) row vector: y1 y2 · · · yn .

A is called a symmetric matrix if A = A′ , e.g.


 Pn 
n i=1 x i
symmetric matrix: Pn Pn 2 ,
i=1 xi i=1 xi
Matrix operations

Element-wise summation and subtraction

C = A ± B; cij = aij ± bij

Inner product of x = (x1 , · · · , xn )′ and y = (y1 , · · · , yn )′ is


n
X
⟨x, y ⟩ = xk yk
k=1

The product of a scalar (a real number) and matrix A is



λ A = λaij
The product of A and B is determined by the inner products
of their rows and columns,
n
X
′ ′
C = A B; cij = ⟨(ai1 , · · · , ain ) , (b1j , · · · , bnj ) ⟩ = aik bkj .
k=1

The (i, j)-th element of C is the inner product of the i-th row
of A and the j-th column of B (viewed as vectors in Rn ).
A must have the same number of columns as the number of
rows of B.
Generally, A B ̸= B A.

The (i, j)-th element of A′ A is the inner product of i-th and


j-th columns of A.
 
y1
n

 y2 
2 2 2
X
yi2

Y Y = y1 y2 · · · yn   = y1 + y2 + · · · yn =
 
..
 . 
i=1
yn
   
1 x1 β0 + x1 β1
1 x2   0 + x2 β1
β
  
 β0
 
X β = =
 
 .. ..  β .. 
 . .  1  . 
1 xn β0 + xn β1
 
1 x1
Pn
1 x2 
   
′ 1 1 ··· 1 n Pni=1 x2i
X X= .. ..  = Pn x
 
x1 x2 · · · xn i=1 xi

 . .  i=1 i
1 xn
 
y1
 
y2   Pn 
′ 1 1 ··· 1 i=1 yi
XY = =
 
x1 x2 · · · xn .. Pn
i=1 xi yi
 
 . 
yn
Special matrices

Diagonal matrix: square matrix with off-diagonal elements equal


to 0
Identity matrix I: diagonal matrix with diagonal elements equal
to 1
Scalar matrix: diagonal matrix with diagonal elements being
the same, λI

1: a column vector with all elements equal to 1; J: a square


matrix with all elements equal to 1; 0: a column vector with
all elements equal to 0

1′ 1 = n; 11′ = J
Rank of matrix
The rank of a matrix is defined to be the maximum number of
linearly independent columns in the matrix.
Some useful facts:

Rank(A B) ≤ min(Rank(A), Rank(B));


Rank(A) = Rank(A′ ) ≤ min(n, p);
Rank(A′ A) = Rank(A A′ ) = Rank(A),

where (n, p) are the number of rows and columns of A.

The p columns of A is called linearly dependent if there exist


λ1 , · · · , λp not all zero such that

λ1 A1 + λ2 A2 + · · · + λp Ap = 0;

otherwise they are linearly independent, where Ai is the i-th


column of A.
Matrix inverse

The inverse of A is another matrix, denoted by A−1 , such that


−1
AA = A−1 A = I .

A−1 exists if Rank(A) equals to the number of rows or columns,


when A is said to be invertible, or nonsingular, or of full rank.

A simple example of matrix inverse:


   
a b −1 1 d −b
A= , A = ,
c d D −c a

where D = ad − bc is the determinant of A.


 Pn  n
Pnn Pni=1 x2i ,
X

X X= D=n (xi − X̄ )2
i=1 xi i=1 xi i=1
 
1 X̄ 2 Pn −X̄
n + Pn 2 2
(X′ X)−1 =  i=1 (xi −X̄ ) i=1 (xi −X̄ ) 
Pn −X̄ 2
Pn 1 2
i=1 (xi −X̄ ) i=1 (xi −X̄ )
Some basic results for matrix

A + B = B + A; (A + B) + C = A +(B + C)
(A B) C = A(B C); C(A + B) = C A + C B; k(A + B) = k A +k B
(A′ )′ = A; (A + B)′ = A′ + B′ ; (A B)′ = B′ A′ ; (A B C)′ = C′ B′ A′
(A−1 )−1 = A; (A B)−1 = B−1 A−1
(A B C)−1 = C−1 B−1 A−1 ; (A′ )−1 = (A−1 )′
Random vector and matrix

A random vector or a random matrix contains elements that


are random variables.
Its expectations is defined as the element-wise expectations,

E (A) = [E (Aij )]

The variance-covariance matrix of a random vector


 
y1
 y2 
Y = . 
 
.
 . 
yn
are defined as
 
Var(y1 ) Cov(y1 , y2 ) · · · Cov(y1 , yn )
 Cov(y2 , y1 ) Var(y2 ) ··· Cov(y2 , yn ) 
Var(Y ) =  ,
 
.. .. ..
 . . . 
Cov(yn , y1 ) Cov(yn , y2 ) · · · Var(yn )

which is symmetric because Cov(yi , yj ) = Cov(yj , yi ).

Suppose A is a constant matrix, define W = A Y . We have

E (A) = A
E (W) = E (A Y ) = A E (Y )
Cov(W) = Var(A Y ) = A Var(Y )A′ .
Derivative in matrix function

Consider the linear function of parameters

β = (β1 β2 · · · βp )′

in the matrix form

F (β) = A′1 β = β ′ A1 ; A1 = (a11 a21 · · · ap1 )′

It’s obvious that


∂F (β)
= ai1 .
∂βi
So the derivative of F (β) w.r.t. β is

∂F (β)  ∂F (β) ∂F (β) ∂F (β) ′


= , ··· = A1
∂β ∂β1 ∂β2 ∂βp

Consider the following quadratic function of β

Q(β) = β ′ Q β; Q = (qij )i=1,··· ,p; j=1,··· ,p ,

where the terms involving βi are


 
X 
qii βi2 + (qij + qji )βj βi .
 
j̸=i
Therefore,
p
∂ Q(β) X X
= 2qii βi + (qij + qji )βj = (qij + qji )βj ,
∂βi
j̸=i j=1

and hence we have the vector of derivatives


∂ Q(β)  ∂Q(β) ∂Q(β) ∂Q(β) ′
= , ··· = (Q +Q′ )β.
∂β ∂β1 ∂β2 ∂βp

For any quadratic function, we can always write it using a


symmetric matrix

1 ∂Q(β)
Q̃ = (Q +Q′ ); Q(β) = β ′ Q̃ β; = 2Q̃ β .
2 ∂β
in regression analysis , the symmetric set up is more useful
Simple linear regression in matrix form

The simple linear regression model

yi = β0 + xi β1 + ei ; i = 1, · · · , n,

can be compactly written in a matrix form,

Y = E (Y ) + e = X β +e,

where
y1 1 x1 e1
     
y2 1 x2   e2
β0
     
Y = .. , X= .. .. , β= , e= .. ,
     
β1
 .   . .   . 
yn 1 xn en
response vector design matrix coefficient vector error vector

and e ∼ N(0n , σ2 I).


Least square estimation

Our goal is to minimize the quadratic difference

Q(β) = ∥Y − X β ∥2 = (Y − X β)′ (Y − X β)
= β ′ X′ X β −β ′ X′ Y − Y ′ X β +Y ′ Y
= β ′ X′ X β −2β ′ X′ Y + Y ′ Y .

Then we have
∂Q(β)
= 2X′ X β −2X′ Y = 2X′ (X β −Y )
∂β

So the OLS estimate β̂ = [β̂0 , β̂1 ]′ satisfies the following normal


equation

X (X β̂ − Y ) = 0,
which says the residuals

ê = Y − X β̂

are orthogonal to the columns of X, i.e.


n
X n
X
1′ ê = êi = 0; xi êi = 0.
i=1 i=1

We can also derive the solution in matrix form as

β̂ = (X′ X)−1 X′ Y

what if X’X is not invertible


OLS esimates

Unbiasedness:

E (β̂) = (X′ X)−1 X′ E (Y ) = (X′ X)−1 X′ X β = β

Variance:

Var(β̂) = (X′ X)−1 X′ Cov(Y ) X(X′ X)−1


= (X′ X)−1 X′ σ 2 I X(X′ X)−1 = σ 2 (X′ X)−1

Moreover,
??
′ −1 ′ ′ −1 ′ 2
β̂ = (X X) XY ∼ (X X) X N(X β, σ I)
∼ N(β, σ 2 (X′ X)−1 )
Fitted values

The fitted values


   
yb1 1 x1
yb2 1 x2 
 
 β̂0
  
Yb =   = X β̂ =  .
  
.. .. .. 
 .   . .  β̂1
ybn 1 xn

Then we have
Yb = X β̂ = X(X′ X)−1 X′ Y ,
where H = X(X′ X)−1 X′ is the hat matrix or projection matrix.
It is obvious that H is symmetric and

HH = H.

In general, a matrix A is said to be idempotent if A A = A.

Also, we have

E (Yb ) = H E (Y ) = H XY = X(X′ X)−1 X′ Xβ = Xβ


Var(Yb ) = H Cov(Y )H′ = H σ 2 I H = σ 2 H

Therefore, it follows that

Yb = H Y ∼ N(Xβ, σ 2 H)
Residuals

Similarly, we can write residuals as

ê = Y − Yb = Y − H Y = (I − H)Y .

The matrix I − H is also symmetric and idempotent, and

E (ê) = (I − H)E (Y ) = (I − H) X β = 0
Var(ê) = (I − H) Cov(Y )(I − H)′ = σ 2 (I − H)

Therefore, it follows that

ê = (I − H)Y ∼ N(0, σ 2 (I − H))


Residuals

The estimated error variance is


n
2 1 X 2 ê ′ ê
σ̂ = ei =
n−2 n−2
i=1

It can be verified that tr() = trace() = summation of the diagonal elements

E (ê ′ ê) = E (tr (ê ′ ê)) = E (tr (ê ê ′ ))


= tr (E (ê ê ′ )) = tr (Var(ê) + E (ê)E (ê)′ )
= tr (σ 2 (I − H)) = σ 2 tr (I − H)
= σ 2 (n − 2) a good quick formula for var(x) , BUT random vector version

Therefore, σ̂ 2 is an unbiased estimate of σ 2


Analysis of variance

Recall that J = 11′ , we have


n n n
!2
X X X
Y ′Y = yi2 , Y ′ 1 = yi ; Y ′ JY = Y ′ 11′ Y = yi .
i=1 i=1 i=1

Therefore,
n n
X X 1
SStot = yi2 − ( yi )2 /n = Y ′ Y − Y ′ JY
n
i=1 i=1
SSres = ê ê = Y (I − H)Y = Y ′ Y − Y ′ H Y
′ ′

= ê ′ (Y − X β̂) = ê ′ Y = (Y − X β̂)′ Y = Y ′ Y − β̂ ′ X′ Y
1 1
SSreg = SStot − SSres = β̂ ′ X′ Y − Y ′ JY = Y ′ H Y − Y ′ JY .
n n
We thus have a unified formula for these three sum of squares

Y′AY,

where A are
1
SSTO : I − J;
n
SSE : I − H;
1
SSR : H − J,
n
so they are all quadratic functions of Y.
Statistical inferences

As showed in a previous slide,


" 2 #
1
+ Pn x̄ 2
Pn −x̄ 2
2 ′ −1 2 n i=1 i −x̄)
(x i=1 (xi −x̄)
Var(β̂) = σ (X X) =σ .
Pn −x̄ 2
Pn 1 2
i=1 (xi −x̄) i=1 (xi −x̄)

replace σ 2 with σ̂ 2 we can derive the estimated variance.

The mean response at x∗ can be written as

yb∗ = X∗′ β̂; where X∗ = [1 x∗ ]′ ,


so

y∗ ) = X∗′ Var(β̂)X∗ = σ 2 X∗′ (X′ X)−1 X∗


Var(b
(x∗ − x̄)2
 
2 1
= σ + nP 2
.
n i=1 (xi − x̄)

Replacing σ 2 with σ̂ 2 , we can derive the estimated variance

(x∗ − x̄)2
 
2 2 1
sefit (b
y∗ ) = σ̂ + Pn 2
.
n i=1 (xi − x̄)

When predicting a new observation, we have

Var(pred) = σ 2 + Var(ỹ∗ ) = σ 2 (1 + X∗′ (X′ X)−1 X∗ ),

and its estimate with σ 2 replaced by MSE,

sepred 2 (ỹ∗ ) = σ̂ 2 (1 + X∗′ (X′ X)−1 X∗ ).

You might also like