0% found this document useful (0 votes)
41 views32 pages

SI2018

This document discusses system identification and control. It provides an introduction to system identification, describing how it is used to construct mathematical models of systems from experimental data. The key steps in system identification are: 1) designing experiments, 2) performing experiments and collecting data, 3) determining an appropriate model structure, 4) estimating model parameters, and 5) validating the model. Shift-operator calculus and the z-transform are also introduced as methods for representing discrete-time linear systems.

Uploaded by

m garivani75
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views32 pages

SI2018

This document discusses system identification and control. It provides an introduction to system identification, describing how it is used to construct mathematical models of systems from experimental data. The key steps in system identification are: 1) designing experiments, 2) performing experiments and collecting data, 3) determining an appropriate model structure, 4) estimating model parameters, and 5) validating the model. Shift-operator calculus and the z-transform are also introduced as methods for representing discrete-time linear systems.

Uploaded by

m garivani75
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

System Identification and Control (BI3SI)

William Harwin
based on notes by X. Hong and W.S. Harwin
L e c t u r e 1 : I n t ro d u c t i o n
Books:
• Ljung, L and T. Söderström (1983): Theory and Practice of Recursive Identification, The MIT Press[1].
• Ljung, Lennart (1987): System identification: Theory for the user, Prentice Hall[2][3].
• Söderström, T and P. Stoica (1989): System Identification, Prentice Hall.[4]
• Aström, K. J. and B. Wittenmark (1989): Adaptive Control, Addison Wesley.

S y s t e m I d e n t i f i c at i o n :
Assessment is via examination and an assignment (see blackboard)
There may be additional information on the websites http://www.personal.rdg.ac.uk/~sis01xh/
http://www.reading.ac.uk/~shshawin/LN

I n t ro d u c t i o n t o S y s t e m i d e n t i f i c at i o n
Systems and models
Systems are objects, of which we would like to study, control and affect the behavior. There are tasks typically related to
the study and use of the systems. e.g.
• A ship: We would like to control its direction by manipulating the rudder angle.
• A telephone channel: We would like to construct a receiver with good reproduction of the transmitted signal.
• A time series of data (e.g. sales): We would like to predict the future values.
A model is the knowledge of the properties of a system. It is necessary to have a model of the system in order to solve
problems such as control, signal processing, system design. Sometimes the aim of modelling is to aid in design. Sometimes
a simple model is useful for explaining certain phenomena and reality.
A model may be given in any one of the following forms:
1. Mental, intuitive or verbal models: Knowledge of the system’s behavior is summarized in a person’s mind.
2. Graphs and tables: e.g. the step response, Bode plot.
3. Mathematical models: Here we confine this class of model to differential and difference equations.
There are two ways of constructing mathematical models:
1. Physical modelling: Basic laws from physics such as Newton’s laws and balance equations are used to describe the
dynamical behavior of a process. Yet in many cases the processes are so complex that it is not possible to obtain
reasonable models using only physical insight, so we are forced to use the alternative:
2. System identification: Some experiments are performed on the system; A model is then fitted to the recorded data
by assigning suitable values to its parameters.
System identification is the field of modeling dynamic systems from experimental data. A dynamical system can be
conceptually described by the following Figure 1.

Distubances v(t)

Input u(t) Output y(t)


System

Figure 1: A dynamical system with input u(t), output y(t) and disturbance v(t), where t denotes time.

The system is controlled by input u(t). The user cannot control v(t). The output y(t) can be measured and gives
information about the system. For a dynamical system the control action at time t will influence the output at time
instants s > t.

How system identification is applied


1. Design of experiments.
2. Perform experiments, and collect data.

1
3. Determine/choose model structure.
4. Estimate model parameters.
5. Model validation.
In practice the procedure is iterated until an acceptable model is found. A good model should encompass essential
information without becoming too complex.

Figure 2: System identification and computer control

S h i f t - o p e r at o r c a l c u l u s
The forward-shift operator
qf (k) = f (k + 1) or qfk = fk+1 (1)
The backward-shift operator
q −1 f (k) = f (k − 1) or q −1 fk = fk−1 (2)
The shift operator is used to simplify the manipulation of higher-order difference equations. Consider the equation
y(k) + a1 y(k − 1) + ... + ana y(k − na)
= b0 u(t − d) + b1 u(t − d − 1) + ... + bnb u(t − d − nb)
(3)

The model can be expressed as


A(q −1 )y(k) = B(q −1 )u(k) (4)
where
A(q −1 ) = a0 + a1 q −1 + ...ana q −na , and a0 = 1. (5)
B(q −1 ) = (b0 + b1 q −1 + ...bnb q −nb )q −d (6)
These discrete representations are thus linear and time invariant.
• Linear because αy = αu for some real number α (scaling the input results in an identically scaled output).
• Time invarient because q n y = q n u for some real number n, (the process does not change on different days).

Z-transform
Given a signal x(t) sampled every T seconds then the z-transform is

X
X(z) = Z[x(kT )] = x(kT )z −k
k=0

Lecture 2: Least squares method


P o s i t i v e d e f i n i t e m at r i c e s
Consider the following function f , defined with a matrix M and input vector x

f (x) = xT Mx

2
0.7

0.6

0.5

0.4
f

0.3

0.2

0.1

0
1

0.5 1
0.5
0
0
−0.5
−0.5
−1 −1
x x
2 1

Figure 3: The function f (x) = xT Mx is always positive therefore M is positive definate.

That is f is a quatratic function of x.


If f is always positive (and 0 except when x = [0, 0]T ) irrespective of x then M is positive definite. If x is a two valued
vector x = [x1 , x2 ]T , then f looks like a bowl resting on the origin as seen in the Figure below, for a positive definite M.
Two other shapes can result from the quadratic form. If f is always negative then M is known as negative definite. If
f is positive in some regions and negative on others then it describes a saddle. In all cases f is zero when x = [0, 0]T .

−0.5 0.5

−1

0
−1.5

−2
−0.5
−2.5

−3
−1
1
−3.5
0.5 1
−4 1 0.5
1 0.5 0
0.5 0
0 −0.5
0
−0.5 −0.5
−0.5
−1 −1 −1 −1

Figure 4: Left: Matrix M is negative definate. Right: Matrix M defines a saddle.

Theorem: If a matrix M = φT φ, then it is positive definite.


Proof. The quadratic form is
f (x) = xT Mx = xT φT φx
If we can show that f is always positive then M must be positive definite.
f (x) = (φx)T φx
Denote z = φx so X
f (x) = zT z = zi2
i

so f must always be positive (except when x = [0, 0]T ).

3
Linear regression

10

8 y(5)

y(4)

y(3)
y(t)

y(2)

y(1)
Example 1
Example 2
0

u(1) u(2) u(3) u(4) u(5)


-2
0 1 2 3 4 5 6 7 8
u(t)

Figure 5: Two models fitted with least squares linear regression.

The simplest parametric model is linear regression.


y(t) = ŷ(t) + e(t) = φT (t)θ + e(t)
where y(t) is a measurable quantity. φ(t) is a vector of known quantities, and θ is a vector of unknown parameters. e(t) is
modelling error. t denotes the label for data samples.
Given a model form and a data set, we wish to calculate the parameter θ. One way is to minimise the errors between
the actual y(t) and the model prediction ŷ(t).

Example 2-1:
y(t) = au(t) + b + e(t)
θ = [a, b]T , φ(t) = [u(t), 1]T .
t 1 2 3 4 5
u(t) 1 2 4 5 7
y(t) 1 3 5 7 8
   
ŷ(1) 1 1

 ŷ(2)  
  2 1 

ŷ = 
 ŷ(3) =
  4 1  θ = φθ

 ŷ(4)   5 1 
ŷ(5) 7 1

Example 2-2:
y(t) = b2 u2 (t) + b1 u(t) + b0 + e(t)
θ = [b2 , b1 , b0 ]T , φ(t) = [u2 (t), u(t), 1]T .
 
ŷ(1)

 ŷ(2) 

ŷ =
 .. 

 . 

 ŷ(N − 1) 
ŷ(N )
u2 (1)
 
u(1) 1

 u2 (2) u(2) 1 
=
 .. .. ..  θ

 . . . 

 u2 (N − 1) u(N − 1) 1 
u2 (N ) u(N ) 1
= φθ

4
Example 2-3:
y(t) = b2 u(t) + b1 u(t − 1) + b0 u(t − 2) + e(t)
θ = [b2 , b1 , b0 ]T , φ(t) = [u(t), u(t − 1), u(t − 2)]T .

 
ŷ(1)

 ŷ(2) 

ŷ =
 .. 

 . 

 ŷ(N − 1) 
ŷ(N )
 
u(1) 0 0

 u(2) u(1) 
 0
= .. .. ..
θ
 

 . .  .
 u(N − 1) u(N − 2) u(N − 3) 
u(N ) u(N − 1) u(N − 2)
= φθ

Denote the actual output vector


y = [y(1), y(2), ..., y(N )]T .
A model can be estimated by using y (actual measurement of output) as a target of ŷ (the output that is expected by the
model).

D e r i vat i o n o f t h e l e a s t s q u a r e s a l g o r i t h m
For t = 1, ..., N ,
e(t) = y(t) − ŷ(t)
In vector form  
e(1)

 e(2) 

..
e=  = y − ŷ
 
 . 
 e(N − 1) 
e(N )
PN
SSE(sum of squared errors)= t=1 e2 (t) = eT e
The model parameter vector θ is derived so that the distance between these two vectors is the smallest possible, i.e.
SSE is minimized.

eT e =(y − ŷ)T (y − ŷ)


=(yT − ŷT )(y − ŷ)
←− ŷ = φθ, (φθ)T = θ T φT
=(yT − θ T φT )(y − φθ)
=yT y − 2θ T φT y + θ T φT φθ
=yT y − yT φ[φT φ]−1 φT y + yT φ[φT φ]−1 φT y
| {z }
0
T T T T
− 2θ φ y + θ φ φθ
=y y − yT φ[φT φ]−1 φT y
T

+ (θ − [φT φ]−1 φT y)T φT φ(θ − [φT φ]−1 φT y)

Note that only the last term contains θ. The SSE is minimal when the last term is minimised.

eT e =yT y − yT φ[φT φ]−1 φT y


+ (θ − [φT φ]−1 φT y)T φT φ(θ − [φT φ]−1 φT y)
=yT y − yT φ[φT φ]−1 φT y
+ xT φT φx

5
where x = θ − [φT φ]−1 φT y.
SSE has the minimum yT y − yT φ[φT φ]−1 φT y when


0
x = θ − [φT φ]−1 φT y =  ... 
 

The solution of least squares parameter estimate is

θ̂ = [φT φ]−1 φT y

where θ̂ means the estimate of θ.

Example 2-4:
For Example 2-1 y = [1, 3, 5, 7, 8]T , and  
1 1

 2 1 

φ=
 4 1 

 5 1 
7 1

θ̂ = [φT φ]−1 φT y
 
1 1
  2 1 
1 2 4 5 7 
 4
 −1
= [ 1 
]
1 1 1 1 1 
 5 1 
7 1
 
1
  3 
1 2 4 5 7  
×  5 
1 1 1 1 1   7 

8
 −1  
95 19 118
=
19 5 24
 
1.1754
=
0.3333

and the model of least squares fit is


ŷ(t) = 1.1754u(t) + 0.3333

A n a lt e r n at i v e p ro o f :
SSE = eT e = yT y − 2θ T φT y + θ T φT φθ
SSE has the minimum when
∂(SSE)
=0
∂θ
∂(SSE) ∂(yT φθ) T ∂(θ T φT φθ)
= −2[ ] +
∂θ ∂θ ∂θ
= −2φT y + 2φT φθ
= −2φT (y − φθ) = 0
yielding
θ̂ = [φT φ]−1 φT y

6
L e c t u r e 3 : M o d e l r e p r e s e n tat i o n s f o r dy n a m i c a l s y s t e m s
Consider a dynamical system with input signal {u(t)} and output signal {y(t)}. Suppose that these signals are sampled
in discrete time t = 1, 2, 3... and the sample values can be related through a linear difference equation

y(t) + a1 y(t − 1) + ... + ana y(t − na ) = b1 u(t − 1)... + b2 u(t − 2) + ... + bnb u(t − nb ) + v(t)

where v(t) is some disturbance.


Let q −1 be the backward shift (or delay) operator: q −1 y(t) = y(t − 1).
Then the system can be rewritten as

A(q −1 )y(t) = B(q −1 )u(t) + v(t)

where
A(q −1 ) = 1 + a1 q −1 + ... + ana q −na
B(q −1 ) = b1 q −1 + ... + bnb q −nb
The system can be expressed as a linear regression

y(t) = φT (t)θ + v(t)

with
θ = [a1 , ..., ana , b1 , ..., bnb ]T

φ(t) = [−y(t − 1), ..., −y(t − na ), u(t − 1), ..., u(t − nb )]T
If {v(t)} is a sequence of independent random variable with zero mean values. We call it ‘white noise’.
If no input is present and {v(t)} is white, then the system becomes

y(t) + a1 y(t − 1) + ... + ana y(t − na ) = v(t)

which is known as an autoregressive (AR) processes of order na .

The ARMAX Model


See 11 for the acronym expansion of ARMAX, or read on.
Suppose the disturbance is described as a moving average (MA) of a white noise sequence {e(t)}:

v(t) = C(q −1 )e(t)

C(q −1 ) = 1 + c1 q −1 + ... + cnc q −nc


Then the resultant model is
A(q −1 )y(t) = B(q −1 )u(t) + C(q −1 )e(t)
known as an ARMAX model. The model is a combination of an autoregressive (AR) part

A(q −1 )y(t),

a moving average (MA) part


C(q −1 )e(t),
and a control part
B(q −1 )u(t),
with u(t) being referred as as an eXogeneous variable.
When there is no input, the model is
A(q −1 )y(t) = C(q −1 )e(t)
which is known as an ARMA process, a very common type of model for stochastic signals.
Another variant is
y(t) = B(q −1 )u(t) + v(t)
referred as a finite impulse response model (FIR) model.
The ARMAX Model is a linear regression
y(t) = φT (t)θ + e(t)
with
θ = [a1 , ..., ana , b1 , ..., bnb , c1 , ...cnc ]T
φ(t) = [−y(t − 1), ..., −y(t − na ), u(t − 1)..., u(t − nb ), e(t − 1), ..., e(t − nc )]T

7
Basic examples of ARMAX Systems
The basic examples to represent the systems are given by
y(t) + ay(t − 1) = bu(t − 1) + ce(t − 1) + e(t)
where {e(t)} is a white noise sequence with variance σ 2 .
Suppose that two sets of parameters are used:
System 1: a = −0.8, b = 1.0, c = 0.0, σ = 1.0
y(t) − 0.8y(t − 1) = u(t − 1) + e(t)
System 2: a = −0.8, b = 1.0, c = −0.8, σ = 1.0
x(t) − 0.8x(t − 1) = u(t − 1)
y(t) = x(t) + e(t)

e(t)
1
e(t) 1−0.8q−1
x(t) + +
u(t) q −1 y (t) u(t) q −1 y (t)
1−0.8q−1 + 1−0.8q−1 +

Figure 6: Left: System 1 Right: System 2

Example 3-1
System 1 is simulated generating 1000 data points. The input u(t) is a uniformly distributed random signal in [-2,2].

Matlab code generating data (system 1):


>> u=2-4*rand(1,1000); % zero mean uniform distribution
>> e=randn(1,1000);
>> y(1)=0;
>> for i=2:1000;
>> y(i)=0.8*y(i-1) + u(i-1) + e(i);
>> end;
>> figure(1); plot(u);xlabel(’t’);ylabel(’u(t)’);
>> figure(2); plot(y);xlabel(’t’);ylabel(’y(t)’);
Suppose that a model in the form of
ŷ(t) = −ay(t − 1) + bu(t − 1)
is used to model the system. The least squares solution is
θ̂ = [φT φ]−1 φT y
based on y = [y(2), ..., y(1000)]T , and  
−y(1) u(1)

 −y(2) u(2) 

φ=
 .. .. 
 . . 

 −y(998) u(998) 
−y(999) u(999)
Matlab code for least squares estimator:
>> for i=2: 1000;
>> phi(i,:)=[y(i-1) u(i-1)];
>> end;
>> theta=(phi’*phi)\phi’*y’;
The model output is generated using
ŷ(t) = 0.8142y(t − 1) + 1.0413u(t − 1)

8
2

u(t)
0

-1

-2
100 120 140 160 180 200
t
10

5
y(t)

-5
100 120 140 160 180 200
t
Figure 7: Upper: System 1 with a zero mean uniform signal in [−2, 2] as input. Lower: Output of the ARMAX process
described by System 1

Table 1: Parameter estimates for Example 3-1.

Parameter True value Estimated value


a -0.8 -0.8142
b 1.0 1.0413

Matlab code for model output:


>> yhat=phi*theta;
>> figure(3);
>> i=1:1000;
>> plot(i,y(i),’-’,i,yhat(i),’-.’);
>> xlabel(’t’);

9
10
System output
Model output

-5
100 110 120 130 140 150 160 170 180 190 200
t
Figure 8: The model parameters computed by the least squares algorithm are used to generate the model output and
show good agreement with the system output.

R e c u r s i v e i d e n t i f i c at i o n
Suppose at time instant (n − 1), we already calculated θ̂, but now new output value y(t) is available at time n, we
need to recalculate θ̂.
We would like a way of efficiently recalculating the model each time we have new data. Ideal form would be

θ̂(n) = θ̂(n − 1) + correction f actor

Thus if the model is correct at time (n − 1) and the new data at time n is indicative of the model then the correction
factor would be close to zero.

Example 3-2
Estimation of a constant. Assume that a model is
y(t) = b
This means a constant is to be determined from a number of noisy measurements y(t), t = 1, ..., n.
θ = b, φ(t) = 1. y = [y(1), ..., y(n)]T , and  
1
 1 
 
φ =  ... 
 
 
 1 
1

θ̂ = [φT φ]−1 φT y
n
1X
= y(t)
n t=1

This is simply the mean of the signal.

10
If we introduce a label (n − 1) to represent the fact that n − 1 data points are used in deriving the mean, such that

θ̂(n − 1) =[φT (n − 1)φ(n − 1)]−1 φT (n − 1)y(n − 1)


n−1
1 X
= y(t)
n − 1 t=1

Thus for n data points,


n
1X
θ̂(n) = y(t)
n t=1
n−1
1 X
= { y(t) + y(n)}
n t=1
1
= {(n − 1)θ̂(n − 1) + y(n)}
n
1
= {nθ̂(n − 1) − θ̂(n − 1) + y(n)}
n
1
=θ̂(n − 1) + (y(n) − θ̂(n − 1))
n
=θ̂(n − 1) + K(n)(y(n) − θ̂(n − 1))
1
with K(n) = n. The above equation is the least squares algorithm in recursive form. General form of recursive
algorithms is
θ̂(n) = θ̂(n − 1) + K(n)ε(n)
where θ̂(n − 1) is a vector of model parameters estimates at time (n − 1).
θ̂(n) is a vector of model parameters estimates at time n.
ε(n) is the difference between the measured output y(n) and the model output using previous model with parameter
vector θ̂(n − 1).
K(n) is the scaling - also known as the Kalman Gain.

Advantages of recursive model estimation


• Gives an estimate of the model from the first time step.
• Can be computationally more efficient and less memory intensive, especially if we can avoid doing large matrix
inverse calculations.
• Can be made to adapt to a changing system, e.g. online system identification allows telephone systems to do echo
cancellation on long distance lines.
• Can be used for fault detection (model estimates start to differ radically from a norm).
• Forms the core of adaptive control strategies and adaptive signal processing.
• Ideal for real-time implementations.
L e c t u r e 4 : R e c u r s i v e i d e n t i f i c at i o n a l g o r i t h m
In off-line or batch identification, data up to some t = N is first collected, then the model parameter vector θ̂ is
computed.
In on-line or recursive identification, the model parameter vector θ̂(n) is required for every t = n. It is important
that memory and computational time for updating the model parameter vector θ̂(n) do not increase with t. This poses
constraints on how the estimates may be computed.
General form of recursive algorithms is

θ̂(n) = θ̂(n − 1) + K(n)ε(n)

In the following we will derive the recursive form of the least squares algorithm by modifying the off-line algorithm.

Derivation of recursive least squares


Suppose we have all the data collected up to time n. Then consider the formation of the regression matrix Φ(n) as
 T 
φ (1)
 φT (2) 
φ(n) = 
 
.. 
 . 
T
φ (n)

11
and output vector y(n) = [y(1), ..., y(n)]T . Thus the least squares solution is

θ̂(n) = [Φ(n)T Φ(n)]−1 Φ(n)T y(n)

When a new data point comes, n is increased by 1, the least squares solution as above requires repetitive calculation
including recalculating the inverse (expensive in computer time and storage)
Let’s look at the expression Φ(n)T Φ(n) and Φ(n)T y(n) and define P−1 (n) = Φ(n)T Φ(n)

P−1 (n) = Φ(n)T Φ(n)


φT (1)
 
 φT (2) 
= [φ(1), φ(2), ..., φ(n)] 
 
.. 
 . 
φT (n)
n
X
= φ(t)φT (t)
t=1
n−1
X
= φ(t)φT (t) + φ(n)φT (n)
t=1
−1
= P (n − 1) + φ(n)φT (n)
 
y(1)
 y(2) 
Φ(n)T y(n) = [φ(1), φ(2), ..., φ(n)]  .
 
 ..


y(n)
n
X
= φ(t)y(t)
t=1
n−1
X
= φ(t)y(t) + φ(n)y(n)
t=1
= Φ(n − 1)T y(n − 1) + φ(n)y(n)

P−1 (n) = P−1 (n − 1) + φ(n)φT (n) (7)


T T
Φ(n) y(n) = Φ(n − 1) y(n − 1) + φ(n)y(n) (8)

The least squares estimate at time n

θ̂(n) = [Φ(n)T Φ(n)]−1 φ(n)T y(n)


= P(n)φ(n)T y(n)
P(n) Φ(n − 1)T y(n − 1) + φ(n)y(n)

= (9)

Because the least squares estimate at time (n − 1) is given by

θ̂(n − 1) = P(n − 1)φ(n − 1)T y(n − 1),

so P−1 (n − 1)θ̂(n − 1) = φ(n − 1)T y(n − 1) (10)

Substitute (10) into (9)

 
θ̂(n) =P(n) P−1 (n − 1)θ̂(n − 1) + φ(n)y(n)
←− Applying (7)
 
=P(n) P−1 (n)θ̂(n − 1) − φ(n)φT (n)θ̂(n − 1) + φ(n)y(n)
 
=θ̂(n − 1) + P(n)φ(n) y(n) − φT (n)θ̂(n − 1)

=θ̂(n − 1) + K(n)ε(n)

12
Thus the recursive least squares (RLS) are

ε(n) = y(n) − φT (n)θ̂(n − 1) (11)


−1 T −1
P(n) = (P (n − 1) + φ(n)φ (n)) (12)
K(n) = P(n)φ(n) (13)
θ̂(n) = θ̂(n − 1) + K(n)ε(n) (14)

Yet we still require a matrix inverse to be calculated in (6).


However this can be avoided through matrix inversion lemma (to be derived next week).
L e c t u r e 5 : D e r i vat i o n o f r e c u r s i v e l e a s t
squares(continued)
Matrix inversion Lemma: If A, C, BCD are nonsingular square matrix (the inverse exists) then

[A + BCD]−1 = A−1 − A−1 B[C−1 + DA−1 B]−1 DA−1

The best way to prove this is to multiply both sides by [A + BCD].

[A + BCD][A−1 − A−1 B[C−1 + DA−1 B]−1 DA−1 ]

= I + BCDA−1 − B[C−1 + DA−1 B]−1 DA−1


−BCDA−1 B[C−1 + DA−1 B]−1 DA−1

= I + BCDA−1 − B CC −1 −1 −1 −1
| {z }[C + DA B] DA
−1

−BCDA−1 B[C−1 + DA−1 B]−1 DA−1

= I + BCDA−1
−BC {C−1 + DA−1 B}[C−1 + DA−1 B]−1 DA−1
| {z }
I

= I

Now look at the derivation of RLS algorithm so far and consider applying the matrix inversion lemma to (2) below

ε(n) = y(n) − φT (n)θ̂(n − 1) (15)


−1 T −1
P(n) = (P (n − 1) + φ(n)φ (n)) (16)
K(n) = P(n)φ(n) (17)
θ̂(n) = θ̂(n − 1) + K(n)ε(n) (18)

with
A = P−1 (n − 1), B = φ(n)
C = 1, D = φT (n)

P(n) = (P−1 (n − 1) + φ(n)φT (n))−1


= A−1 − A−1 B[C−1 + DA−1 B]−1 DA−1
= P(n − 1)
−P(n − 1)φ(n)[1 + φT (n)P(n − 1)φ(n)]−1 φT (n)P(n − 1)
P(n − 1)φ(n)φT (n)P(n − 1)
= P(n − 1) −
1 + φT (n)P(n − 1)φ(n)
The RLS algorithm is

= y(n) − φT (n)θ̂(n − 1)
ε(n)
P(n − 1)φ(n)φT (n)P(n − 1)
P(n) = P(n − 1) −
1 + φT (n)P(n − 1)φ(n)
K(n) = P(n)φ(n)
θ̂(n) = θ̂(n − 1) + K(n)ε(n)

13
Here the term ε(n) should be interpreted as a prediction error. It is the difference between the measured output y(n) and
the one-step-ahead prediction
ŷ(n|n − 1, θ̂(n − 1)) = φT (n)θ̂(n − 1)
made at time t = (n − 1). If ε(n) is small θ̂(n − 1) is good and should not be modified very much.
K(n) should be interpreted as a weighting factor showing how much the value of ε(n) will modify the different elements
of the parameter vector.
The algorithm also needs initial values of θ̂(0) and P(0). It is convenient to set the initial values of θ̂(0) to zeros and
the initial value of P(0) to LN × I, where I is the identity matrix and LN is a large number.

Example 5-1:
(see Example 3-2) Estimation of a constant.
Assume that a model is
y(t) = b
This means a constant is to be determined from a number of noisy measurements y(t), t = 1, ..., n.
Since φ(n) = 1
P(n − 1)φ(n)φT (n)P(n − 1)
P(n) = P(n − 1) −
1 + φT (n)P(n − 1)φ(n)
P2 (n − 1)
= P(n − 1) −
1 + P(n − 1)
P(n − 1)
=
1 + P(n − 1)
i.e.
P−1 (n) = P−1 (n − 1) + 1 = n
1
K(n) = P(n) =
n
Thus θ̂(n) = θ̂(n − 1) + n1 ε(n) coincides with the results of Example 2 in Lecture 3.

Example 5-2:
The system input/output of a process is measured as in the following Table. Suppose the process can be described by
a model y(t) = ay(t − 1) + bu(t − 1) = φT (t)θ. If a recursive least squares algorithm is applied for on-line parameter
estimation and at time  
T 1000 0
t = 2, θ̂(2) = [0.8, 0.1] , P(2) = , find θ̂(3), P(3).
0 1000
t 1 2 3 4
y(t) 0.5 0.6 0.4 0.5
u(t) 0.3 0.4 0.5 0.7
Note φ(t) = [y(t − 1), u(t − 1)]T

ε(3) = y(3) − φT (3)θ̂(2)


 
0.8
= 0.4 − [0.6, 0.4] = −0.12
0.1

P(2)φ(3)φT (3)P(2)
P(3) = P(2) − 1+φT (3)P(2)φ(3)

 
1000 0
= −
0 1000
    
1000 0 0.6  1000 0
  [0.6,0.4] 
0 1000 0.4 0 1000
  
1000 0  0.6
1+[0.6,0.4] 
0 1000 0.4

14
 
  
600 [600, 400]
1000 0 400
= −  
0 1000 0.6
1+[600,400] 
0.4
 
  
360000 240000 
1000 0 240000 360000
= − 521
0 1000
 
309.0211 −460.6526
=
−460.6526 309.0211

K(3) = P(3)φ(3)
  
309.0211 −460.6526 0.6
=
−460.6526 309.0211 0.4
 
1.1516
=
−152.7831
θ̂(3) = θ̂(2) + K(3)ε(3)
   
0.8 1.1516
= + (−0.12)
0.1 −152.7831
 
0.6618
=
18.4340

RLS algorithm with a forgetting factor


The RLS algorithm can be modified for tracking time varying parameters. One approach is the RLS algorithm with a
forgetting factor. For a linear regression model
y(t) = φT (t)θ + e(t)
The loss function to be minimized in a least squares algorithm is
n
X
V (θ) = [y(t) − φT (t)θ]2
t=1

then we minimize this with respect to θ. Consider modifying the loss function to
n
X
V (θ) = λ(n−t) [y(t) − φT (t)θ]2
t=1

where λ < 1 is the forgetting factor. e.g. λ = 0.99 or 0.95. This means that as n increases, the measurements obtained
previously are discounted. Older data has less effect on the coefficient estimation, hence “forgotten”.
The RLS algorithm with a forgetting factor is
ε(n)= y(n) − φT (n)θ̂(n − 1)
1 P(n − 1)φ(n)φT (n)P(n − 1)
P(n) = {P(n − 1) − }
λ λ + φT (n)P(n − 1)φ(n)
K(n) = P(n)φ(n)
θ̂(n) = θ̂(n − 1) + K(n)ε(n)
When λ = 1 this is simply the RLS algorithm.
The smaller the value of λ, the quicker the information in previous data will be forgotten.
Using the RLS algorithm with a forgetting factor, we may speak about “real time identification”. When the properties
of the process may change (slowly) with time, the algorithm is able to track the time-varying parameters describing such
process.

Summary of the RLS algorithm:

1. Initialization: Set na , nb , λ, θ̂(0) and P(0). Step 2-5 are repeated starting from t = 1.
2. At time step t = n, measure current output y(n).
3. Recall past y’s and u’s and form φ(n).
4. Apply RLS algorithm for θ̂(n) and P(n)
5. θ̂(n) → θ̂(n − 1) and P(n) → P(n − 1)
6. t = n + 1, go to step 2.

15
Lecture 6
Example 6-1:
Consider a system described by

y(t) + a1 y(t − 1) + a2 y(t − 2) = b1 u(t − 1) + e(t)

where {e(t)} is a white noise sequence with variance 1. The input u(t) is a uniformly distributed random signal in [-2,2].
The system parameters are
a1 = −0.8, a2 = 0, b1 = 1.0
for the first 500 data samples and then these change to

a1 = −1.1, a2 = 0.3, b1 = 2.0

for the next 500 data samples.

25

20

15

10
output y

−5

−10

−15
0 100 200 300 400 500 600 700 800 900 1000

Bias,consistency and model approximation

An estimate θ̂ is biased if its expected value deviates from the true value.

E(θ̂) 6= θ

where E notation is the expected value. The difference E(θ̂) − θ is called the bias.
(The expected value of a variable is the average value that the variable would have if averaged over an extremely long
PN
period of time (N → ∞), E ≈ limN →∞ N1 t=1 )

Example 6-2:
Suppose a system (System 2 in Lecture 3) is given by

y(t) + ay(t − 1) = bu(t − 1) + ce(t − 1) + e(t)

where {e(t)} is a white noise sequence with variance σ 2 , with a = −0.8, b = 1.0, c = −0.8, σ = 1.0
System 2 is simulated generating 1000 data points. The input u(t) is a uniformly distributed random signal in [-2,2].

16
4

1
θ

−1

−2

−3
0 100 200 300 400 500 600 700 800 900 1000
t

Figure 9: Evolution of parameter estimates with RLS algorithm (λ = 1)

Matlab code generating data (system 1):


>> u=2-4*rand(1,1000); % zero mean uniform distribution
>> e=randn(1,1000);
>> y(1)=0;
>> for i=2:1000
>> y(i)=0.8*y(i-1) + u(i-1) -0.8*e(i-1)+ e(i);
>> end
>> figure(1); plot(u);xlabel(’t’);ylabel(’u(t)’);
>> figure(2); plot(y);xlabel(’t’);ylabel(’y(t)’);

Suppose that we still use the model


ŷ(t) = −ay(t − 1) + bu(t − 1)
to model the system (see Example 1, Lecture 3).

Table 2: Parameter estimates for System 1 (taken from Example 1, Lecture 3).

Parameter True value Estimated value


a -0.8 -0.8142
b 1.0 1.0413

Table 3: Parameter estimates for System 2.

Parameter True value Estimated value


a -0.8 -0.6122
b 1.0 0.9581

Thus for System 2 the estimate â will deviate from the true value. We say that an estimate is consistent if

E(θ̂) = θ

It seems that the parameter estimator for System 1 is consistent, but not System 2. The models can be regarded as
the approximation to the system. The model using the derived LS parameter estimator for System 2 is not good.
Recall that System 1 and System 2 are given by

y(t) − 0.8y(t − 1) = u(t − 1) + v(t)

17
4

1
θ

−1

−2

−3
0 100 200 300 400 500 600 700 800 900 1000
t

Figure 10: Evolution of parameter estimates with RLS algorithm (λ = 0.99)

e(t)
x(t) +
u(t) q −1 y (t)
1−0.8q−1 +
Figure 11: System 2

with
System 1: v(t) = e(t) as a white noise.
System 2: v(t) = −0.8e(t − 1) + e(t) as a moving average (MA) of a white noise sequence {e(t)}.

Properties of the least squares estimate


Assume that the data obey the linear regression

y(t) = ŷ(t) + e(t) = φT (t)θ + e(t)

or in matrix form
y = Φθ + e
where e(t) is a stochastic variable with zero mean and variance σ 2 .
Lemma 1 : Assume further that e(t) is a white noise sequence and consider the estimate

θ̂ = [ΦT Φ]−1 ΦT y.

The following properties hold :


(i) θ̂ is an unbiased estimate of θ.
(ii) The covariance matrix of θ̂ is given by

cov(θ̂) = σ 2 [ΦT Φ]−1 .

18
Proof : (i)

θ̂ = [ΦT Φ]−1 ΦT y
= [ΦT Φ]−1 ΦT (Φθ + e)
= θ + [ΦT Φ]−1 ΦT e
E(θ̂) = θ + E([ΦT Φ]−1 ΦT e)
= θ + [ΦT Φ]−1 ΦT E(e) = θ
(ii) h i
E (θ̂ − θ)(θ̂ − θ)T

= E [ΦT Φ]−1 ΦT e([ΦT Φ]−1 ΦT e)T


 

= E [ΦT Φ]−1 ΦT eeT Φ[ΦT Φ]−1


 

= [ΦT Φ]−1 ΦT E(eeT )Φ[ΦT Φ]−1


Note that the white noise assumption implies that E(eeT ) = σ 2 I

cov(θ̂) = [ΦT Φ]−1 ΦT σ 2 IΦ[ΦT Φ]−1

= σ 2 [ΦT Φ]−1
ΦT Φ
Assuming that N is a finite full rank matrix, we have

σ 2 ΦT Φ −1
cov(θ̂) = [ ] → 0 as N → ∞
N N

Input signals
The input signal used in a system identification experiment can have a significant influence on the resulting parameter
estimates. Some conditions on the input sequence must be introduced to yield a reasonable identification results. As a
trivial illustration, we may think of an input that is identically zero. Such an input will not be able to yield full information
about the input/output relationship of the system.
The input should excite the typical modes of the system, and this is called persistent exciting. In mathematical term
T
this is related to the full rank condition of ΦNΦ .
1. Step function: A step function is given by

0 t<0
u(t) =
u0 t≥0

The users has to choose the amplitude u0 . For systems with a large signal-to-noise ratio, and step input can given
information about the dynamics. Rise time, overshoot and static gain are directly related to the step response. Also the
major time constants and a possible resonance can be at least crudely estimated from the step response. (ones.m)
2. Uniformly distributed pseudo-random numbers: rand.m generating a sequence drawn from a uniform distribution
on the unit interval.
3. ARMA – Autoregressive moving average process:

C(q −1 )u(t) = D(q −1 )e(t)

where
C(q −1 ) = 1 + c1 q −1 + ... + cnc q −nc
D(q −1 ) = 1 + d1 q −1 + ... + dnd q −nd
are set by the user. Different choices of the filter parameters ci and di lead to input signal with various frequency contents,
thus emphasizing the identification in different frequency ranges. (f ilter.m)
4. Sum of sinusoids
m
X
u(t) = aj sin(ωj t + ϕj )
j=1

where the angular frequency ωj are distinct. 0 ≤ ω1 < ω2 < ... < ωm ≤ π. The user has to choose aj ωj , and ϕj .
5. The input can be (partly) a feedback signal from the output (closed loop identification using a known feedback)

19
Lecture 7: Pseudo linear regression for ARMAX model
Consider the ARMAX model
A(q −1 )y(t) = B(q −1 )u(t) + C(q −1 )e(t)
or

y(t) = −a1 y(t − 1) − ... − ana y(t − na )


+b1 u(t − 1) + ... + bnb u(t − nb )
+c1 e(t − 1) + ... + cnc e(t − nc ) + e(t)

To simply the notation, we study a first order system:

y(t) = −ay(t − 1) + bu(t − 1) + ce(t − 1) + e(t)


(19)

If we rewrite (1) as

y(t) = φT (t)θ + v(t)

in which v(t) = ce(t − 1) + e(t) is moving average of e(t).


and
φ = [−y(t − 1), u(t − 1)]T , θ = [a, b]T
In matrix form
y = Φθ + v
The least squares estimate
θ̂ = [ΦT Φ]−1 ΦT y
= [ΦT Φ]−1 ΦT (Φθ + v)
= θ + [ΦT Φ]−1 ΦT v
E(θ̂) = θ + E([ΦT Φ]−1 ΦT v)
The second term E([ΦT Φ]−1 ΦT v) 6= 0 is the bias. This is due to v(t) is a correlated signal (colored noise)
However (1) can be rewritten as

y(t) = φT (t)θ + e(t)

with
φ = [−y(t − 1), u(t − 1), e(t − 1)]T
θ = [a, b, c]T
Here in φ, y(t − 1) and u(t − 1) are known at time t, but e(t − 1) is unknown. However we can replace it using the
prediction error ε(t − 1). ε(0) can be initialized as 0.
For this system, the RLS algorithm is given by

φ(n) = [−y(n − 1), u(n − 1), ε(n − 1)]T


ε(n) = y(n) − φT (n)θ̂(n − 1)
1 P(n − 1)φ(n)φT (n)P(n − 1)
P(n) = {P(n − 1) − }
λ λ + φT (n)P(n − 1)φ(n)
K(n) = P(n)φ(n)
θ̂(n) = θ̂(n − 1) + K(n)ε(n)

Summary of the RLS algorithm for ARMAX:

1. Initialization: Set na ,nb , nc , λ, θ̂(0) and P(0). ε(0), ..., ε(1 − nc ) = 0. Step 2-5 are repeated starting from t = 1.
2. At time step t = n, measure current output y(n).
3. Recall past y’s and u’s and ε’s form φ(n).
4. Apply RLS algorithm for ε(n), θ̂(n) and P(n)
5. θ̂(n) → θ̂(n − 1) and P(n) → P(n − 1) and ε(n) → ε(n − 1).
6. t = n + 1, go to step 2.

20
3 4

2
2
1

0.8e(t)+e(t)
e(t)

0 0

−1
−2
−2

−3 −4
0 20 40 60 80 100 0 20 40 60 80 100

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0

−0.2 −0.2
−10 −5 0 5 10 −10 −5 0 5 10

Figure 12: Comparison of autocorrelation coefficients

Example 7-1:
Consider a system described by
y(t) + ay(t − 1) = bu(t − 1) + ce(t − 1) + e(t)
where {e(t)} is a white noise sequence with variance 1. The system parameters are

a = −0.9, b = 1.0, c = 0.3

The input
πt π πt
u(t) = 2 sin( + ) + sin( )
50 3 30
1000 data samples are generated.

Determining the model structure


For a given data set, consider using the linear regression

y(t) = φT (t)θ + e(t)

with a sequence of model structures of increasing dimension to obtain the best fitted models within each of the model
structures.
e.g. The first model is (na = 1, nb = 1)

y(t) = −a1 y(t − 1) + b1 u(t − 1) + e(t)

and the second model is (na = 2, nb = 2)

y(t) = −a1 y(t − 1) − a2 y(t − 2)


+b1 u(t − 1) + b2 u(t − 2) + e(t)

With more free parameters in the second model, a better fit will be obtained. In determining the best model structure, the
important thing is to investigate whether or not the improvement in the fit is significant.

21
3

1
input u

−1

−2

−3
0 100 200 300 400 500 600 700 800 900 1000
t

30

20

10
output y

−10

−20

−30

−40
0 100 200 300 400 500 600 700 800 900 1000
t

Figure 13: System input and output (dataset 7-3).

PN
The model fit is given by the loss function V (θ̂) = N1 t=1 ε2 (t) for each model. Depending on the data set, we may
obtain either of the following Figures:
If a model is more complex than is necessary, the model may be over-parameterized.

Example 7-2.
(Over-parameterization) Suppose that the true signal y(t) is described by the ARMA process

y(t) + ay(t − 1) = e(t) + ce(t − 1)

Let the model set be given by


y(t) + a1 y(t − 1) + a2 y(t − 2)
= e(t) + c1 e(t − 1) + c2 e(t − 2)
we see that all ai , ci such that
1 + a1 q −1 + a2 q −2 = (1 + aq −1 )(1 + dq −1 )
1 + c1 q −1 + c2 q −2 = (1 + cq −1 )(1 + dq −1 )
with an arbitrary number d give the same description of the system. ai , ci can not be uniquely determined, we may say
T
that model is over-parameterized. Consequently ΦNΦ is singular.

22
6

4
prediction error

−2

−4
0 100 200 300 400 500 600 700 800 900 1000
t

1.5

0.5
θ

−0.5

−1
0 100 200 300 400 500 600 700 800 900 1000
t

Figure 14: Results of RLS algorithm (λ = 1).

V( θ ) V( θ )

model 1 model 2 model 1 model 2


model size model size

Figure 15: Left: Model 2 is preferable as Model 1 is not large enough to cover the true system. Right: Model 1 is
preferable as Model 2 is more complex, but not significantly better than Model 1.

23
L e c t u r e 8 : T h e i n s t ru m e n ta l va r i a b l e m e t h o d
The instrumental variable method is a modification of the least squares method designed to overcome the convergence
problems when the disturbance v(t) is not white noise.
For linear regression model

y(t) = φT (t)θ + v(t)

or in vector form

y = Φθ + v

The least squares solution


θ̂ = [ΦT Φ]−1 ΦT y
with
E(θ̂) = θ + E([ΦT Φ]−1 ΦT v)
The term E([ΦT Φ]−1 ΦT v) is zero if
E(ΦT v) = 0
This is true in the following typical cases:
(i) v(t) is white noise.
(ii) φ(t) contains only input terms u(t) which is uncorrelated v(t), such that E(ΦT v ) = 0. (If v(t) is not white
noise, and φ(t) contains y(t − 1), ...,y(t − na ), then y(t − 1) contains v(t − 1) which in turn is correlated with v(t), then
E(ΦT v) 6= 0, and θ̂ is biased).
The instrumental variable method is given by

θ̂ = [ZT Φ]−1 ZT y

XN N
X
T −1
=[ z(t)φ (t)] zT (t)y(t)
t=1 t=1

where z(t) is a vector with the same dimension as φ(t).


We see that θ̂ tends to θ as N tends to infinity under the following three conditions.
(i) z(t) and v(t) are uncorrelated.
(ii) ZT Φ is invertible.
(iii) v(t) has zero mean.
The vector z(t) is referred to as the instrumental variable (IV). We now show that instrumental variable estimate is
unbiased:
θ̂ = [ZT Φ]−1 ZT y
= [ZT Φ]−1 ZT (Φθ + v)
= θ + [ZT Φ]−1 ZT v
E(θ̂) = θ + E([ZT Φ]−1 ZT v)
= θ + [ZT Φ]−1 E(ZT v) = θ
The recursive IV algorithm is

ε(n) = y(n) − φT (n)θ̂(n − 1)


P(n − 1)z(n)φT (n)P(n − 1)
P(n) = P(n − 1) −
1 + φT (n)P(n − 1)z(n)
K(n) = P(n)z(n)
θ̂(n) = θ̂(n − 1) + K(n)ε(n)

Choice of the instrumental variable z(t)


Loosely speaking the choice of the instrumental variable z(t) is that they should be sufficiently correlated to φ(t) but
uncorrelated with the system noise v(t).
Suppose that the following model is used

y(t) = −a1 y(t − 1) − ... − ana y(t − na )


+b1 u(t − 1) + ... + bnb u(t − nb ) + v(t)
= φT (t)θ + v(t)

24
with
φ(t) = [−y(t − 1), ... − y(t − na ),
u(t − 1), ..., u(t − nb )]T
θ = [a1 , ..., ana , b1 , ..., bnb ]T
A common choice of z(t) is
z(t) = [−yM (t − 1), ... − yM (t − na ),
u(t − 1), ..., u(t − nb )]T
where yM (t − 1) is noise free output of a system driven by the actual input u(t) using âi and b̂i which are current estimates
obtained in the recursive IV algorithm.

Example 8-1:
Consider a system described by
y(t) + ay(t − 1) = bu(t − 1) + ce(t − 1) + e(t)
where {e(t)} is a white noise sequence with variance 1. The system parameters are

a = −0.9, b = 1.0, c = 0.3

The input
πt π πt
u(t) = 0.1 sin( + ) + 0.2 sin( )
50 3 30
1000 data samples are generated.
Choose
φ(t) = [−y(t − 1), u(t − 1)]T
z(t) = [−ŷ(t − 1), u(t − 1)]T

0.3

0.2

0.1
input u

−0.1

−0.2

−0.3

−0.4
0 100 200 300 400 500 600 700 800 900 1000
t

15

10

5
output y

−5

−10

−15
0 100 200 300 400 500 600 700 800 900 1000
t

Figure 16: System input and output.8-1

25
4

2
prediction error e

−2

−4
0 100 200 300 400 500 600 700 800 900 1000
t

2
θ

−2
0 100 200 300 400 500 600 700 800 900 1000
t

Figure 17: Results of recursive instrumental algorithm

Maximum likelihood estimation


There is a further important issue to mention, namely the relation of LS estimator with the Maximum Likelihood method
in estimation theory.
Suppose that the process noise e(t) is a discrete random process, and we know its probability density function as a
functional of the parameter vector θ, such that we know p(e; θ).
Now we have N data samples, given just as before {y(t), u(t) }, how do we estimate θ.
The idea of maximum likelihood estimation is to maximize a likelihood function which is often defined as the joint
probability of e(t).
Suppose e(t) is uncorrelated, the likelihood function L can be written as (the joint probability of e(t))
N
Y
L(θ) = p(e(t), θ)
t=1

This means that the Likelihood function is the product of data each samples pdf.
Consider using log likelihood function log L. Log function is a monotonous function. This means when L is maximum,
so is log L.
Instead of looking for θ̂ that maximizes L, we now look for θ̂ that maximizes log L, the result will be the same, but
computation is simpler.
∂ log L(θ)
|θ=θ̂ = 0
∂θ
If e(t) is Gaussian distributed with zero mean, and variance σ 2

1 e2 (t)
p(e(t), θ) = √ exp{− 2 }
2πσ 2σ

1 [y(t) − θ T φ(t)]2
=√ exp{− }
2πσ 2σ 2

26
N
Y
log L(θ) = log p(e(t), θ)
t=1
N
X
= log p(e(t), θ)
t=1
N
X [y(t) − θ T φ(t)]2 N
=− − log(2πσ 2 )
t=1
2σ 2 2
[y − Φθ]T [y − Φθ]
=− +c
2σ 2
By setting
∂ log L(θ)
|θ=θ̂ = 0
∂θ
we obtain
∂[y − Φθ]T [y − Φθ]
=0
∂θ
∂[y − Φθ]T [y − Φθ] ∂(yT Φθ) T ∂(θ T ΦT Φθ)
= −2[ ] +
∂θ ∂θ ∂θ
= −2ΦT y + 2ΦT Φθ
= −2ΦT (y − Φθ) = 0
yielding
θ̂ = [ΦT Φ]−1 ΦT y
which is simply the LS estimates.
The conclusion is that under the assumption that the noise is Gaussian distributes, then the LS estimates is equivalent
to maximum likelihood estimates.
References
[1] Lennart Ljung et al. Theory and practice of recursive identification. The MIT press, 1983 (cit. on p. 1).
[2] Lennart Ljung. System identification. Wiley Online Library, 1999. d o i: 10.1002/047134608X.W1046.pub2 (cit. on
p. 1).
[3] Lennart Ljung. System Identification: Theory For the User. (UoR call 003.1.LJU). Second Edition, Upper Saddle
River, N.J: Prentice Hall, 1999 (cit. on p. 1).
[4] Torsten Söderström and Petre Stoica. System identification. Prentice-Hall, Inc., 1988 (cit. on p. 1).
[5] Cleve B. Moler. Numerical Computing with MATLAB. SIAM, 2004. d o i: 10 . 1137 / 1 . 9780898717952. u r l:
https://uk.mathworks.com/moler/index_ncm.html.
[6] Cleve Moler. Professor SVD. http://www.mathworks.com/tagteam/35906_91425v00_clevescorner.pdf. 2006
(cit. on p. 30).

A p p e n d i x : M at r i x A l g e b r a
Matrix quick reference guide/databook etc
M at r i x a n d v e c t o r d e f i n i t i o n s
A matrix is a two dimensional array that contains expressions or numbers. In general matrices will be notated as bold
uppercase letters, sometimes with a trailing subscript to indicate its size.
For example:  
a11 a12 a13
 a21 a22 a23 
A4×3 = A =   a31 a32 a33 

a41 a42 a43


is a matrix with 4 rows and 3 columns containing 12 articles (the location of which is given by the article subscript).
A vector is a collection of symbols or numbers, and can be represented as a matrix with a single column. Vectors are
sometimes notated as a bold lowercase symbol, or with a short line or arrow that may be below or above the symbol.
For example  
b1
 b2  T
b = b = b̄ = b = 
 b3  = [b1 , b2 , b3 , b4 ]

b4
T
is a vector with 4 elements. Note that denotes transpose.

27
M at r i x o p e r at i o n s
1. Addition:
C = A + B such that cij = aij + bij (A and B must be the same size)

2. Transpose:
The transpose of a matrix interchanges rows and columns.
C = AT such that cij = aji

3 . I d e n t i t y m at r i x :
The identity matrix I is square and has 1s on the major diagonal, elsewhere 0s.
 
1 0 ··· 0 0
 0
 1 ··· 0 0  
I =  ... .. .. .. 

 . ··· . .  
 0 0 ··· 1 0 
0 0 ... 0 1

4 . M u lt i p l i c at i o n :
P
The definition of matrix multiplication is C = AB such that cij = k aik bkj
(The number of columns of A must be the same as number of rows of B)
Matrices are associative so that
A + AB = A(I + B) 6= A + BA = (I + B)A
Matrices are not commutative in multiplication, i.e.

AB 6= BA

Example:
 let 
a11 a12 a13  
 a21 a22 a23  b11 b12
A=  and B =  b21 b22 
 a31 a32 a33 
b31 b32
a41 a42 a43
Then C = AB = A4×3 B3×2

     
c11 c12 a11 b11 + a12 b21 + a13 b31 a11 b12 + a12 b22 + a13 b32 a11 a12 a13  
 c21 b11 b12
c22   a21 b11 + a22 b21 + a23 b31 a21 b12 + a22 b22 + a23 b32   a21 a22 a23 
  b21

 c31
= = b22 
c32   a31 b11 + a32 b21 + a33 b31 a31 b12 + a32 b22 + a33 b32   a31 a32 a33 
b31 b32
c41 c42 a41 b11 + a42 b21 + a43 b31 a41 b12 + a42 b22 + a43 b32 a41 a42 a43

Subscript notation
To work out if two matrices will multiply it is useful to subscript the matrix variable with row and column information,
thus
This can be determined by considering the dimensions of each matrix. For example if we write A2×3 to indicate that it
has 2 rows and 3 columns, then
A2×3 B3×5 = C2×5
is realised by cancelling the inner 3’s resulting in a 2 by 5 matrix.

5 . M u lt i p l i c at i o n o f pa rt i t i o n e d m at r i c e s
Any matrix multiplication can also be carried out on sub-matrices and vectors as long as each sub-matrix/vector is a valid
matrix calculation. (You can colour in the individual
" # articles in the equation above to demonstrate.)
  G
For example if A = D f and B = then
hT
 
DG
AB =
f hT

28
6 . M at r i x i n v e r s e :
If a matrix is square and not singular it has an inverse such that

AA−1 = A−1 A = I

Where I is the identity matrix.


AA−1 = I
A−1 exists only if A is square and not singular. A is singular if the determinant |A| = 0.
A simple example is
 
a b
A=
c d

|A| = ad − bc
   
d −b d −b
−c a −c a
A−1 = =
|A| ad − bc
Computation of the inverse can be done by computing the matrix determinant of |A|, the matrix of minors, the
co-factor, and the ajoint/adjugate. Thus if X is the ajoint/ajugate of A

XA = AX = I|A|

Computing the determinant and the ajoint requires computing a matrix of minors and hence the co-factor matrix. The
co-factor matrix is then simply the transpose of the ajoint/ajugate.

Note on numerical computation


Some matrices are do not have independent rows so do not have an inverse. This can be tested by computing a matrix
rank.
Some matrices are ill-conditioned so that the inverse is difficult to compute. A measure called the condition number,
defined as the ratio of the maximum singular to the minimum singular value, indicate whether the matrix inverse will be
sensitive with respect to the accuracy of the computer.
In Matlab the condition number can be calculated as

>> max(svd(A))/min(svd(A))

or via the cond command

Note on Cayley-Hamilton Theorem


The Cayley-Hamilton Theorem says that ‘Every square matrix satisfies its own characteristic equation.’
The Characteristic equation is the solution to
|sI − A| = 0
This gives a way to calculating the inverse of A by substituting into the characteristic equation and multiplying by
A−1 .

A l g e b r a ru l e s :
A+B=B+A (Addition is commutative)
AB 6= BA (Multiplication not commutative)
A(BC) = (AB)C (Associative)
A(B + C) = AB + AC (Associative)
AI = AI = A
AA−1 = A−1 A = I (for non-singular and square matrix)
(AB)T = BT AT
(AB)−1 = B−1 A−1

29
M at r i x d e c o m p o s i t i o n
M at r i x t y p e s
C = CT such that cij = cji (If C is symmetric it must be square)
A matrix is symmetric if B = B T
A matrix is skew symmetric if B = −B T
A matrix is orthogonal if B −1 = B T
A matrix is positive definite if xT Bx is positive for all values of the vector x
The scalar (Ax)T (Ax) will always be zero or positive. Thus if the matrix B can be partitioned such that B = AT A it will
be positive definite.

O rt h o g o n a l m at r i c e s
If a vector is transformed by an orthogonal matrix then Euclidean metrics are conserved. That is if L(x) = xT x and
y = Ax then evidently L(y) = y T y = (Ax)T Ax = xT AT Ax
If the measures are conserved then L(x) = L(y) so AT A = I and hence orthogonality requires AT = A−1

E i g e n va l u e s a n d E i g e n v e c t o r s
|λI − A| = 0
• If A is real then complex Eigenvalues, if they occur, will be as complex conjugate pairs.
• A symmetric matrix has only real Eigenvalues and the Eigenvectors are orthogonal.
Given the Eigenvalues, the Eigenvectors are such that

AV = V diagλi
Where the columns of V are the Eigenvectors and diag lambda is a diagonal matrix of the eigenvalues.
If A is symmetric and [V, D] are the Eigenvectors and diagonalised Eigenvalues such that AV = V D. Then since
V −1 = V T , A−1 = V T D−1 V and since D is diagonal the inverse is now trivial...

Rank vs Eigenvalues and Eigenvectors

S i n g u l a r va l u e d e c o m p o s i t i o n
For a matrix A we can partition A such that A = U DV T where U T U = I, V T V = I and D is diagonal. The diagonal
elements of D are known as singular values σ
q
σi = λi (AT A)

where λi (AT A) is are the Eigenvalues of AT A


See [6] for a good overview.

Transfer functions
Interesting identity
A(I + A)−1 = (I + A)−1 A

T h i n g s t o d o w i t h 2 x 2 m at r i c e s
Let  
a b
A=
c d
determinate |A| = ad − bc
inverse
d b
   
−1 1 d −b ad−bc − ad−bc
A = = c a
|A| −c a − ad−bc ad−bc

Eigenvalues  √ 
1/2 a + 1/2 d + 1/2 √a2 − 2 ad + d2 + 4 bc
λ=
1/2 a + 1/2 d − 1/2 a2 − 2 ad + d2 + 4 bc

30
U s e f u l M at l a b f u n c t i o n s
A*B A+B A\B A/B matrix multiplication, addition and division (left and right)
eye(n) create an identity matrix size n
svg(A) compute the singular value decomposition matrices of A
eig(A) compute the Eigenvalues and Eigenvectors of A
rank(A) compute rank (number of independent rows or columns in A)
cond(A) compute condition number (an indication of numerical error in the inverse)
det(A) compute determinant
inv(A) compute inverse
pinv(A) compute pseudo inverse = (AAT )−1 AT
adjoint(A) compute adjoint/adjugate
eigshow demonstration of Eigenvalues, (can drag the x vector)
References
[1] Lennart Ljung et al. Theory and practice of recursive identification. The MIT press, 1983 (cit. on p. 1).
[2] Lennart Ljung. System identification. Wiley Online Library, 1999. d o i: 10.1002/047134608X.W1046.pub2 (cit. on
p. 1).
[3] Lennart Ljung. System Identification: Theory For the User. (UoR call 003.1.LJU). Second Edition, Upper Saddle
River, N.J: Prentice Hall, 1999 (cit. on p. 1).
[4] Torsten Söderström and Petre Stoica. System identification. Prentice-Hall, Inc., 1988 (cit. on p. 1).
[5] Cleve B. Moler. Numerical Computing with MATLAB. SIAM, 2004. d o i: 10 . 1137 / 1 . 9780898717952. u r l:
https://uk.mathworks.com/moler/index_ncm.html.
[6] Cleve Moler. Professor SVD. http://www.mathworks.com/tagteam/35906_91425v00_clevescorner.pdf. 2006
(cit. on p. 30).

William Harwin Oct 2017


Ac ro n y m s
ARIMA Autoregressive Integrated Moving Average
ARMAX Autoregressive Moving Average eXogeneous
ARX Autoregressive eXogeneous
CARMA Controlled ARMA
FIR Finite Impulse Response
IIR Infinite Impulse Response
RLS Recursive Least Squares
P ro g r a m s

Least squares example if k==5;M=eye(2);end


if k==6;M=m;end
if k==7;M=1;end
%% System ID example 1
u=[1 2 4 5 7]’;
z=cplxgrid(20);
y=[1 3 5 7 8]’;
X=real(z);
Phi=[u ones(size(u))];
Y=imag(z);
theta=Phi\y
[n,m]=size(X);
tu=[0 8]’;
f=zeros(n,m);
% figure(1);plot(tu,[tu ones(2,1)]*theta,u,y,’o’)
for i=1:n
% grid;xlabel(’u(t)’);ylabel(’y(t)’)
for j=1:m;
% text(u,0.5*ones(1,5),{’u(1)’,’u(2)’,’u(3)’,’u(4)’,’u(5)’});
x=[X(i,j);Y(i,j)];
f(i,j)=x’*M*x;
%%
end
Phi=[u.^2 u ones(size(u))];
end
theta2=Phi\y
surfc(X,Y,f);
tu2=(0:.1:8)’;
%meshc(X,Y,f);
figure(1);plot(tu,[tu ones(2,1)]*theta,tu2,[tu2.^2 tu2 ones(size(tu2))]*theta2,u,y,’o’)
if k>2; clear; end
grid;xlabel(’u(t)’);ylabel(’y(t)’)
text(u+.05,-1.6*ones(1,5),{’u(1)’,’u(2)’,’u(3)’,’u(4)’,’u(5)’})
text(.05*ones(1,5),y+.05,-1.6*ones(1,5),{’y(1)’,’y(2)’,’y(3)’,’y(4)’,’y(5)’})
legend({’Example 1’,’Example 2’},’Location’,’best’)
%
Basic ARMAX example
%% Basic examples of ARMAX
P o s i t i v e d e f i n at e e x a m p l e % approx page 7 of notes

% System 1
%% Examples of positive definate matrices u=2-4*rand(1,1000); % zero mean uniform distribution
if exist(’k’,’var’)==0;k=8;end e=randn(1,1000);
if k >7; y(1)=0;y2(1)=0;
k=menu(’Choose quadratic form’,’random’,’rand pd’,’pos def’,’saddle’,’identity’,’set to m’); for i=2:1000;
end y(i)=0.8*y(i-1) + u(i-1) + e(i); % system 1
y2(i)=0.8*y2(i-1) + u(i-1) - 0.8*e(i-1) + e(i); % system 2
if k==1;M=randn(2);end end;
sr=100:200;% subrange of the data
% random pd
if k==2;M=randn(2);M=M*M’;end %figure(1); plot(sr,u(sr));xlabel(’t’);ylabel(’u(t)’);
%figure(2); plot(sr,y(sr));xlabel(’t’);ylabel(’y(t)’);
% Illustrative pos def figure(1);subplot(211); plot(sr,u(sr));xlabel(’t’);ylabel(’u(t)’);
if k==3;M=[.32 -.11;-.11 .52];end subplot(212);plot(sr,y(sr));xlabel(’t’);ylabel(’y(t)’);
% if k==4;M=[1.08 0.229;2.37 -.267];end
% illustrative saddle % Matlab code for least squares estimator:
if k==4;M=[-.072 1.373;.28 .18];end

31
% phi2=[[0 y(1:end-1)]’ [0 0 y(1:end-2)]’ [0 u(1:end-1)]’]; % Model output
yhat=phi*theta;
for i=2: 1000; figure(3);
phi(i,:)=[y(i-1) u(i-1)]; plot(sr,y(sr),’-’,sr,yhat(sr),’-.’);
end; legend({’System output’,’Model output’},’Location’,’Best’)
theta=(phi’*phi)\phi’*y’; xlabel(’t’);

32

You might also like