SI2018
SI2018
William Harwin
based on notes by X. Hong and W.S. Harwin
L e c t u r e 1 : I n t ro d u c t i o n
Books:
• Ljung, L and T. Söderström (1983): Theory and Practice of Recursive Identification, The MIT Press[1].
• Ljung, Lennart (1987): System identification: Theory for the user, Prentice Hall[2][3].
• Söderström, T and P. Stoica (1989): System Identification, Prentice Hall.[4]
• Aström, K. J. and B. Wittenmark (1989): Adaptive Control, Addison Wesley.
S y s t e m I d e n t i f i c at i o n :
Assessment is via examination and an assignment (see blackboard)
There may be additional information on the websites http://www.personal.rdg.ac.uk/~sis01xh/
http://www.reading.ac.uk/~shshawin/LN
I n t ro d u c t i o n t o S y s t e m i d e n t i f i c at i o n
Systems and models
Systems are objects, of which we would like to study, control and affect the behavior. There are tasks typically related to
the study and use of the systems. e.g.
• A ship: We would like to control its direction by manipulating the rudder angle.
• A telephone channel: We would like to construct a receiver with good reproduction of the transmitted signal.
• A time series of data (e.g. sales): We would like to predict the future values.
A model is the knowledge of the properties of a system. It is necessary to have a model of the system in order to solve
problems such as control, signal processing, system design. Sometimes the aim of modelling is to aid in design. Sometimes
a simple model is useful for explaining certain phenomena and reality.
A model may be given in any one of the following forms:
1. Mental, intuitive or verbal models: Knowledge of the system’s behavior is summarized in a person’s mind.
2. Graphs and tables: e.g. the step response, Bode plot.
3. Mathematical models: Here we confine this class of model to differential and difference equations.
There are two ways of constructing mathematical models:
1. Physical modelling: Basic laws from physics such as Newton’s laws and balance equations are used to describe the
dynamical behavior of a process. Yet in many cases the processes are so complex that it is not possible to obtain
reasonable models using only physical insight, so we are forced to use the alternative:
2. System identification: Some experiments are performed on the system; A model is then fitted to the recorded data
by assigning suitable values to its parameters.
System identification is the field of modeling dynamic systems from experimental data. A dynamical system can be
conceptually described by the following Figure 1.
Distubances v(t)
Figure 1: A dynamical system with input u(t), output y(t) and disturbance v(t), where t denotes time.
The system is controlled by input u(t). The user cannot control v(t). The output y(t) can be measured and gives
information about the system. For a dynamical system the control action at time t will influence the output at time
instants s > t.
1
3. Determine/choose model structure.
4. Estimate model parameters.
5. Model validation.
In practice the procedure is iterated until an acceptable model is found. A good model should encompass essential
information without becoming too complex.
S h i f t - o p e r at o r c a l c u l u s
The forward-shift operator
qf (k) = f (k + 1) or qfk = fk+1 (1)
The backward-shift operator
q −1 f (k) = f (k − 1) or q −1 fk = fk−1 (2)
The shift operator is used to simplify the manipulation of higher-order difference equations. Consider the equation
y(k) + a1 y(k − 1) + ... + ana y(k − na)
= b0 u(t − d) + b1 u(t − d − 1) + ... + bnb u(t − d − nb)
(3)
Z-transform
Given a signal x(t) sampled every T seconds then the z-transform is
∞
X
X(z) = Z[x(kT )] = x(kT )z −k
k=0
f (x) = xT Mx
2
0.7
0.6
0.5
0.4
f
0.3
0.2
0.1
0
1
0.5 1
0.5
0
0
−0.5
−0.5
−1 −1
x x
2 1
−0.5 0.5
−1
0
−1.5
−2
−0.5
−2.5
−3
−1
1
−3.5
0.5 1
−4 1 0.5
1 0.5 0
0.5 0
0 −0.5
0
−0.5 −0.5
−0.5
−1 −1 −1 −1
3
Linear regression
10
8 y(5)
y(4)
y(3)
y(t)
y(2)
y(1)
Example 1
Example 2
0
Example 2-1:
y(t) = au(t) + b + e(t)
θ = [a, b]T , φ(t) = [u(t), 1]T .
t 1 2 3 4 5
u(t) 1 2 4 5 7
y(t) 1 3 5 7 8
ŷ(1) 1 1
ŷ(2)
2 1
ŷ =
ŷ(3) =
4 1 θ = φθ
ŷ(4) 5 1
ŷ(5) 7 1
Example 2-2:
y(t) = b2 u2 (t) + b1 u(t) + b0 + e(t)
θ = [b2 , b1 , b0 ]T , φ(t) = [u2 (t), u(t), 1]T .
ŷ(1)
ŷ(2)
ŷ =
..
.
ŷ(N − 1)
ŷ(N )
u2 (1)
u(1) 1
u2 (2) u(2) 1
=
.. .. .. θ
. . .
u2 (N − 1) u(N − 1) 1
u2 (N ) u(N ) 1
= φθ
4
Example 2-3:
y(t) = b2 u(t) + b1 u(t − 1) + b0 u(t − 2) + e(t)
θ = [b2 , b1 , b0 ]T , φ(t) = [u(t), u(t − 1), u(t − 2)]T .
ŷ(1)
ŷ(2)
ŷ =
..
.
ŷ(N − 1)
ŷ(N )
u(1) 0 0
u(2) u(1)
0
= .. .. ..
θ
. . .
u(N − 1) u(N − 2) u(N − 3)
u(N ) u(N − 1) u(N − 2)
= φθ
D e r i vat i o n o f t h e l e a s t s q u a r e s a l g o r i t h m
For t = 1, ..., N ,
e(t) = y(t) − ŷ(t)
In vector form
e(1)
e(2)
..
e= = y − ŷ
.
e(N − 1)
e(N )
PN
SSE(sum of squared errors)= t=1 e2 (t) = eT e
The model parameter vector θ is derived so that the distance between these two vectors is the smallest possible, i.e.
SSE is minimized.
Note that only the last term contains θ. The SSE is minimal when the last term is minimised.
5
where x = θ − [φT φ]−1 φT y.
SSE has the minimum yT y − yT φ[φT φ]−1 φT y when
0
x = θ − [φT φ]−1 φT y = ...
θ̂ = [φT φ]−1 φT y
Example 2-4:
For Example 2-1 y = [1, 3, 5, 7, 8]T , and
1 1
2 1
φ=
4 1
5 1
7 1
θ̂ = [φT φ]−1 φT y
1 1
2 1
1 2 4 5 7
4
−1
= [ 1
]
1 1 1 1 1
5 1
7 1
1
3
1 2 4 5 7
× 5
1 1 1 1 1 7
8
−1
95 19 118
=
19 5 24
1.1754
=
0.3333
A n a lt e r n at i v e p ro o f :
SSE = eT e = yT y − 2θ T φT y + θ T φT φθ
SSE has the minimum when
∂(SSE)
=0
∂θ
∂(SSE) ∂(yT φθ) T ∂(θ T φT φθ)
= −2[ ] +
∂θ ∂θ ∂θ
= −2φT y + 2φT φθ
= −2φT (y − φθ) = 0
yielding
θ̂ = [φT φ]−1 φT y
6
L e c t u r e 3 : M o d e l r e p r e s e n tat i o n s f o r dy n a m i c a l s y s t e m s
Consider a dynamical system with input signal {u(t)} and output signal {y(t)}. Suppose that these signals are sampled
in discrete time t = 1, 2, 3... and the sample values can be related through a linear difference equation
y(t) + a1 y(t − 1) + ... + ana y(t − na ) = b1 u(t − 1)... + b2 u(t − 2) + ... + bnb u(t − nb ) + v(t)
where
A(q −1 ) = 1 + a1 q −1 + ... + ana q −na
B(q −1 ) = b1 q −1 + ... + bnb q −nb
The system can be expressed as a linear regression
with
θ = [a1 , ..., ana , b1 , ..., bnb ]T
φ(t) = [−y(t − 1), ..., −y(t − na ), u(t − 1), ..., u(t − nb )]T
If {v(t)} is a sequence of independent random variable with zero mean values. We call it ‘white noise’.
If no input is present and {v(t)} is white, then the system becomes
A(q −1 )y(t),
7
Basic examples of ARMAX Systems
The basic examples to represent the systems are given by
y(t) + ay(t − 1) = bu(t − 1) + ce(t − 1) + e(t)
where {e(t)} is a white noise sequence with variance σ 2 .
Suppose that two sets of parameters are used:
System 1: a = −0.8, b = 1.0, c = 0.0, σ = 1.0
y(t) − 0.8y(t − 1) = u(t − 1) + e(t)
System 2: a = −0.8, b = 1.0, c = −0.8, σ = 1.0
x(t) − 0.8x(t − 1) = u(t − 1)
y(t) = x(t) + e(t)
e(t)
1
e(t) 1−0.8q−1
x(t) + +
u(t) q −1 y (t) u(t) q −1 y (t)
1−0.8q−1 + 1−0.8q−1 +
Example 3-1
System 1 is simulated generating 1000 data points. The input u(t) is a uniformly distributed random signal in [-2,2].
8
2
u(t)
0
-1
-2
100 120 140 160 180 200
t
10
5
y(t)
-5
100 120 140 160 180 200
t
Figure 7: Upper: System 1 with a zero mean uniform signal in [−2, 2] as input. Lower: Output of the ARMAX process
described by System 1
9
10
System output
Model output
-5
100 110 120 130 140 150 160 170 180 190 200
t
Figure 8: The model parameters computed by the least squares algorithm are used to generate the model output and
show good agreement with the system output.
R e c u r s i v e i d e n t i f i c at i o n
Suppose at time instant (n − 1), we already calculated θ̂, but now new output value y(t) is available at time n, we
need to recalculate θ̂.
We would like a way of efficiently recalculating the model each time we have new data. Ideal form would be
Thus if the model is correct at time (n − 1) and the new data at time n is indicative of the model then the correction
factor would be close to zero.
Example 3-2
Estimation of a constant. Assume that a model is
y(t) = b
This means a constant is to be determined from a number of noisy measurements y(t), t = 1, ..., n.
θ = b, φ(t) = 1. y = [y(1), ..., y(n)]T , and
1
1
φ = ...
1
1
θ̂ = [φT φ]−1 φT y
n
1X
= y(t)
n t=1
10
If we introduce a label (n − 1) to represent the fact that n − 1 data points are used in deriving the mean, such that
In the following we will derive the recursive form of the least squares algorithm by modifying the off-line algorithm.
11
and output vector y(n) = [y(1), ..., y(n)]T . Thus the least squares solution is
When a new data point comes, n is increased by 1, the least squares solution as above requires repetitive calculation
including recalculating the inverse (expensive in computer time and storage)
Let’s look at the expression Φ(n)T Φ(n) and Φ(n)T y(n) and define P−1 (n) = Φ(n)T Φ(n)
θ̂(n) =P(n) P−1 (n − 1)θ̂(n − 1) + φ(n)y(n)
←− Applying (7)
=P(n) P−1 (n)θ̂(n − 1) − φ(n)φT (n)θ̂(n − 1) + φ(n)y(n)
=θ̂(n − 1) + P(n)φ(n) y(n) − φT (n)θ̂(n − 1)
=θ̂(n − 1) + K(n)ε(n)
12
Thus the recursive least squares (RLS) are
= I + BCDA−1 − B CC −1 −1 −1 −1
| {z }[C + DA B] DA
−1
= I + BCDA−1
−BC {C−1 + DA−1 B}[C−1 + DA−1 B]−1 DA−1
| {z }
I
= I
Now look at the derivation of RLS algorithm so far and consider applying the matrix inversion lemma to (2) below
with
A = P−1 (n − 1), B = φ(n)
C = 1, D = φT (n)
= y(n) − φT (n)θ̂(n − 1)
ε(n)
P(n − 1)φ(n)φT (n)P(n − 1)
P(n) = P(n − 1) −
1 + φT (n)P(n − 1)φ(n)
K(n) = P(n)φ(n)
θ̂(n) = θ̂(n − 1) + K(n)ε(n)
13
Here the term ε(n) should be interpreted as a prediction error. It is the difference between the measured output y(n) and
the one-step-ahead prediction
ŷ(n|n − 1, θ̂(n − 1)) = φT (n)θ̂(n − 1)
made at time t = (n − 1). If ε(n) is small θ̂(n − 1) is good and should not be modified very much.
K(n) should be interpreted as a weighting factor showing how much the value of ε(n) will modify the different elements
of the parameter vector.
The algorithm also needs initial values of θ̂(0) and P(0). It is convenient to set the initial values of θ̂(0) to zeros and
the initial value of P(0) to LN × I, where I is the identity matrix and LN is a large number.
Example 5-1:
(see Example 3-2) Estimation of a constant.
Assume that a model is
y(t) = b
This means a constant is to be determined from a number of noisy measurements y(t), t = 1, ..., n.
Since φ(n) = 1
P(n − 1)φ(n)φT (n)P(n − 1)
P(n) = P(n − 1) −
1 + φT (n)P(n − 1)φ(n)
P2 (n − 1)
= P(n − 1) −
1 + P(n − 1)
P(n − 1)
=
1 + P(n − 1)
i.e.
P−1 (n) = P−1 (n − 1) + 1 = n
1
K(n) = P(n) =
n
Thus θ̂(n) = θ̂(n − 1) + n1 ε(n) coincides with the results of Example 2 in Lecture 3.
Example 5-2:
The system input/output of a process is measured as in the following Table. Suppose the process can be described by
a model y(t) = ay(t − 1) + bu(t − 1) = φT (t)θ. If a recursive least squares algorithm is applied for on-line parameter
estimation and at time
T 1000 0
t = 2, θ̂(2) = [0.8, 0.1] , P(2) = , find θ̂(3), P(3).
0 1000
t 1 2 3 4
y(t) 0.5 0.6 0.4 0.5
u(t) 0.3 0.4 0.5 0.7
Note φ(t) = [y(t − 1), u(t − 1)]T
P(2)φ(3)φT (3)P(2)
P(3) = P(2) − 1+φT (3)P(2)φ(3)
1000 0
= −
0 1000
1000 0 0.6 1000 0
[0.6,0.4]
0 1000 0.4 0 1000
1000 0 0.6
1+[0.6,0.4]
0 1000 0.4
14
600 [600, 400]
1000 0 400
= −
0 1000 0.6
1+[600,400]
0.4
360000 240000
1000 0 240000 360000
= − 521
0 1000
309.0211 −460.6526
=
−460.6526 309.0211
K(3) = P(3)φ(3)
309.0211 −460.6526 0.6
=
−460.6526 309.0211 0.4
1.1516
=
−152.7831
θ̂(3) = θ̂(2) + K(3)ε(3)
0.8 1.1516
= + (−0.12)
0.1 −152.7831
0.6618
=
18.4340
then we minimize this with respect to θ. Consider modifying the loss function to
n
X
V (θ) = λ(n−t) [y(t) − φT (t)θ]2
t=1
where λ < 1 is the forgetting factor. e.g. λ = 0.99 or 0.95. This means that as n increases, the measurements obtained
previously are discounted. Older data has less effect on the coefficient estimation, hence “forgotten”.
The RLS algorithm with a forgetting factor is
ε(n)= y(n) − φT (n)θ̂(n − 1)
1 P(n − 1)φ(n)φT (n)P(n − 1)
P(n) = {P(n − 1) − }
λ λ + φT (n)P(n − 1)φ(n)
K(n) = P(n)φ(n)
θ̂(n) = θ̂(n − 1) + K(n)ε(n)
When λ = 1 this is simply the RLS algorithm.
The smaller the value of λ, the quicker the information in previous data will be forgotten.
Using the RLS algorithm with a forgetting factor, we may speak about “real time identification”. When the properties
of the process may change (slowly) with time, the algorithm is able to track the time-varying parameters describing such
process.
1. Initialization: Set na , nb , λ, θ̂(0) and P(0). Step 2-5 are repeated starting from t = 1.
2. At time step t = n, measure current output y(n).
3. Recall past y’s and u’s and form φ(n).
4. Apply RLS algorithm for θ̂(n) and P(n)
5. θ̂(n) → θ̂(n − 1) and P(n) → P(n − 1)
6. t = n + 1, go to step 2.
15
Lecture 6
Example 6-1:
Consider a system described by
where {e(t)} is a white noise sequence with variance 1. The input u(t) is a uniformly distributed random signal in [-2,2].
The system parameters are
a1 = −0.8, a2 = 0, b1 = 1.0
for the first 500 data samples and then these change to
25
20
15
10
output y
−5
−10
−15
0 100 200 300 400 500 600 700 800 900 1000
An estimate θ̂ is biased if its expected value deviates from the true value.
E(θ̂) 6= θ
where E notation is the expected value. The difference E(θ̂) − θ is called the bias.
(The expected value of a variable is the average value that the variable would have if averaged over an extremely long
PN
period of time (N → ∞), E ≈ limN →∞ N1 t=1 )
Example 6-2:
Suppose a system (System 2 in Lecture 3) is given by
where {e(t)} is a white noise sequence with variance σ 2 , with a = −0.8, b = 1.0, c = −0.8, σ = 1.0
System 2 is simulated generating 1000 data points. The input u(t) is a uniformly distributed random signal in [-2,2].
16
4
1
θ
−1
−2
−3
0 100 200 300 400 500 600 700 800 900 1000
t
Table 2: Parameter estimates for System 1 (taken from Example 1, Lecture 3).
Thus for System 2 the estimate â will deviate from the true value. We say that an estimate is consistent if
E(θ̂) = θ
It seems that the parameter estimator for System 1 is consistent, but not System 2. The models can be regarded as
the approximation to the system. The model using the derived LS parameter estimator for System 2 is not good.
Recall that System 1 and System 2 are given by
17
4
1
θ
−1
−2
−3
0 100 200 300 400 500 600 700 800 900 1000
t
e(t)
x(t) +
u(t) q −1 y (t)
1−0.8q−1 +
Figure 11: System 2
with
System 1: v(t) = e(t) as a white noise.
System 2: v(t) = −0.8e(t − 1) + e(t) as a moving average (MA) of a white noise sequence {e(t)}.
or in matrix form
y = Φθ + e
where e(t) is a stochastic variable with zero mean and variance σ 2 .
Lemma 1 : Assume further that e(t) is a white noise sequence and consider the estimate
θ̂ = [ΦT Φ]−1 ΦT y.
18
Proof : (i)
θ̂ = [ΦT Φ]−1 ΦT y
= [ΦT Φ]−1 ΦT (Φθ + e)
= θ + [ΦT Φ]−1 ΦT e
E(θ̂) = θ + E([ΦT Φ]−1 ΦT e)
= θ + [ΦT Φ]−1 ΦT E(e) = θ
(ii) h i
E (θ̂ − θ)(θ̂ − θ)T
= σ 2 [ΦT Φ]−1
ΦT Φ
Assuming that N is a finite full rank matrix, we have
σ 2 ΦT Φ −1
cov(θ̂) = [ ] → 0 as N → ∞
N N
Input signals
The input signal used in a system identification experiment can have a significant influence on the resulting parameter
estimates. Some conditions on the input sequence must be introduced to yield a reasonable identification results. As a
trivial illustration, we may think of an input that is identically zero. Such an input will not be able to yield full information
about the input/output relationship of the system.
The input should excite the typical modes of the system, and this is called persistent exciting. In mathematical term
T
this is related to the full rank condition of ΦNΦ .
1. Step function: A step function is given by
0 t<0
u(t) =
u0 t≥0
The users has to choose the amplitude u0 . For systems with a large signal-to-noise ratio, and step input can given
information about the dynamics. Rise time, overshoot and static gain are directly related to the step response. Also the
major time constants and a possible resonance can be at least crudely estimated from the step response. (ones.m)
2. Uniformly distributed pseudo-random numbers: rand.m generating a sequence drawn from a uniform distribution
on the unit interval.
3. ARMA – Autoregressive moving average process:
where
C(q −1 ) = 1 + c1 q −1 + ... + cnc q −nc
D(q −1 ) = 1 + d1 q −1 + ... + dnd q −nd
are set by the user. Different choices of the filter parameters ci and di lead to input signal with various frequency contents,
thus emphasizing the identification in different frequency ranges. (f ilter.m)
4. Sum of sinusoids
m
X
u(t) = aj sin(ωj t + ϕj )
j=1
where the angular frequency ωj are distinct. 0 ≤ ω1 < ω2 < ... < ωm ≤ π. The user has to choose aj ωj , and ϕj .
5. The input can be (partly) a feedback signal from the output (closed loop identification using a known feedback)
19
Lecture 7: Pseudo linear regression for ARMAX model
Consider the ARMAX model
A(q −1 )y(t) = B(q −1 )u(t) + C(q −1 )e(t)
or
If we rewrite (1) as
with
φ = [−y(t − 1), u(t − 1), e(t − 1)]T
θ = [a, b, c]T
Here in φ, y(t − 1) and u(t − 1) are known at time t, but e(t − 1) is unknown. However we can replace it using the
prediction error ε(t − 1). ε(0) can be initialized as 0.
For this system, the RLS algorithm is given by
1. Initialization: Set na ,nb , nc , λ, θ̂(0) and P(0). ε(0), ..., ε(1 − nc ) = 0. Step 2-5 are repeated starting from t = 1.
2. At time step t = n, measure current output y(n).
3. Recall past y’s and u’s and ε’s form φ(n).
4. Apply RLS algorithm for ε(n), θ̂(n) and P(n)
5. θ̂(n) → θ̂(n − 1) and P(n) → P(n − 1) and ε(n) → ε(n − 1).
6. t = n + 1, go to step 2.
20
3 4
2
2
1
0.8e(t)+e(t)
e(t)
0 0
−1
−2
−2
−3 −4
0 20 40 60 80 100 0 20 40 60 80 100
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
−0.2 −0.2
−10 −5 0 5 10 −10 −5 0 5 10
Example 7-1:
Consider a system described by
y(t) + ay(t − 1) = bu(t − 1) + ce(t − 1) + e(t)
where {e(t)} is a white noise sequence with variance 1. The system parameters are
The input
πt π πt
u(t) = 2 sin( + ) + sin( )
50 3 30
1000 data samples are generated.
with a sequence of model structures of increasing dimension to obtain the best fitted models within each of the model
structures.
e.g. The first model is (na = 1, nb = 1)
With more free parameters in the second model, a better fit will be obtained. In determining the best model structure, the
important thing is to investigate whether or not the improvement in the fit is significant.
21
3
1
input u
−1
−2
−3
0 100 200 300 400 500 600 700 800 900 1000
t
30
20
10
output y
−10
−20
−30
−40
0 100 200 300 400 500 600 700 800 900 1000
t
PN
The model fit is given by the loss function V (θ̂) = N1 t=1 ε2 (t) for each model. Depending on the data set, we may
obtain either of the following Figures:
If a model is more complex than is necessary, the model may be over-parameterized.
Example 7-2.
(Over-parameterization) Suppose that the true signal y(t) is described by the ARMA process
22
6
4
prediction error
−2
−4
0 100 200 300 400 500 600 700 800 900 1000
t
1.5
0.5
θ
−0.5
−1
0 100 200 300 400 500 600 700 800 900 1000
t
V( θ ) V( θ )
Figure 15: Left: Model 2 is preferable as Model 1 is not large enough to cover the true system. Right: Model 1 is
preferable as Model 2 is more complex, but not significantly better than Model 1.
23
L e c t u r e 8 : T h e i n s t ru m e n ta l va r i a b l e m e t h o d
The instrumental variable method is a modification of the least squares method designed to overcome the convergence
problems when the disturbance v(t) is not white noise.
For linear regression model
or in vector form
y = Φθ + v
θ̂ = [ZT Φ]−1 ZT y
XN N
X
T −1
=[ z(t)φ (t)] zT (t)y(t)
t=1 t=1
24
with
φ(t) = [−y(t − 1), ... − y(t − na ),
u(t − 1), ..., u(t − nb )]T
θ = [a1 , ..., ana , b1 , ..., bnb ]T
A common choice of z(t) is
z(t) = [−yM (t − 1), ... − yM (t − na ),
u(t − 1), ..., u(t − nb )]T
where yM (t − 1) is noise free output of a system driven by the actual input u(t) using âi and b̂i which are current estimates
obtained in the recursive IV algorithm.
Example 8-1:
Consider a system described by
y(t) + ay(t − 1) = bu(t − 1) + ce(t − 1) + e(t)
where {e(t)} is a white noise sequence with variance 1. The system parameters are
The input
πt π πt
u(t) = 0.1 sin( + ) + 0.2 sin( )
50 3 30
1000 data samples are generated.
Choose
φ(t) = [−y(t − 1), u(t − 1)]T
z(t) = [−ŷ(t − 1), u(t − 1)]T
0.3
0.2
0.1
input u
−0.1
−0.2
−0.3
−0.4
0 100 200 300 400 500 600 700 800 900 1000
t
15
10
5
output y
−5
−10
−15
0 100 200 300 400 500 600 700 800 900 1000
t
25
4
2
prediction error e
−2
−4
0 100 200 300 400 500 600 700 800 900 1000
t
2
θ
−2
0 100 200 300 400 500 600 700 800 900 1000
t
This means that the Likelihood function is the product of data each samples pdf.
Consider using log likelihood function log L. Log function is a monotonous function. This means when L is maximum,
so is log L.
Instead of looking for θ̂ that maximizes L, we now look for θ̂ that maximizes log L, the result will be the same, but
computation is simpler.
∂ log L(θ)
|θ=θ̂ = 0
∂θ
If e(t) is Gaussian distributed with zero mean, and variance σ 2
1 e2 (t)
p(e(t), θ) = √ exp{− 2 }
2πσ 2σ
1 [y(t) − θ T φ(t)]2
=√ exp{− }
2πσ 2σ 2
26
N
Y
log L(θ) = log p(e(t), θ)
t=1
N
X
= log p(e(t), θ)
t=1
N
X [y(t) − θ T φ(t)]2 N
=− − log(2πσ 2 )
t=1
2σ 2 2
[y − Φθ]T [y − Φθ]
=− +c
2σ 2
By setting
∂ log L(θ)
|θ=θ̂ = 0
∂θ
we obtain
∂[y − Φθ]T [y − Φθ]
=0
∂θ
∂[y − Φθ]T [y − Φθ] ∂(yT Φθ) T ∂(θ T ΦT Φθ)
= −2[ ] +
∂θ ∂θ ∂θ
= −2ΦT y + 2ΦT Φθ
= −2ΦT (y − Φθ) = 0
yielding
θ̂ = [ΦT Φ]−1 ΦT y
which is simply the LS estimates.
The conclusion is that under the assumption that the noise is Gaussian distributes, then the LS estimates is equivalent
to maximum likelihood estimates.
References
[1] Lennart Ljung et al. Theory and practice of recursive identification. The MIT press, 1983 (cit. on p. 1).
[2] Lennart Ljung. System identification. Wiley Online Library, 1999. d o i: 10.1002/047134608X.W1046.pub2 (cit. on
p. 1).
[3] Lennart Ljung. System Identification: Theory For the User. (UoR call 003.1.LJU). Second Edition, Upper Saddle
River, N.J: Prentice Hall, 1999 (cit. on p. 1).
[4] Torsten Söderström and Petre Stoica. System identification. Prentice-Hall, Inc., 1988 (cit. on p. 1).
[5] Cleve B. Moler. Numerical Computing with MATLAB. SIAM, 2004. d o i: 10 . 1137 / 1 . 9780898717952. u r l:
https://uk.mathworks.com/moler/index_ncm.html.
[6] Cleve Moler. Professor SVD. http://www.mathworks.com/tagteam/35906_91425v00_clevescorner.pdf. 2006
(cit. on p. 30).
A p p e n d i x : M at r i x A l g e b r a
Matrix quick reference guide/databook etc
M at r i x a n d v e c t o r d e f i n i t i o n s
A matrix is a two dimensional array that contains expressions or numbers. In general matrices will be notated as bold
uppercase letters, sometimes with a trailing subscript to indicate its size.
For example:
a11 a12 a13
a21 a22 a23
A4×3 = A = a31 a32 a33
b4
T
is a vector with 4 elements. Note that denotes transpose.
27
M at r i x o p e r at i o n s
1. Addition:
C = A + B such that cij = aij + bij (A and B must be the same size)
2. Transpose:
The transpose of a matrix interchanges rows and columns.
C = AT such that cij = aji
3 . I d e n t i t y m at r i x :
The identity matrix I is square and has 1s on the major diagonal, elsewhere 0s.
1 0 ··· 0 0
0
1 ··· 0 0
I = ... .. .. ..
. ··· . .
0 0 ··· 1 0
0 0 ... 0 1
4 . M u lt i p l i c at i o n :
P
The definition of matrix multiplication is C = AB such that cij = k aik bkj
(The number of columns of A must be the same as number of rows of B)
Matrices are associative so that
A + AB = A(I + B) 6= A + BA = (I + B)A
Matrices are not commutative in multiplication, i.e.
AB 6= BA
Example:
let
a11 a12 a13
a21 a22 a23 b11 b12
A= and B = b21 b22
a31 a32 a33
b31 b32
a41 a42 a43
Then C = AB = A4×3 B3×2
c11 c12 a11 b11 + a12 b21 + a13 b31 a11 b12 + a12 b22 + a13 b32 a11 a12 a13
c21 b11 b12
c22 a21 b11 + a22 b21 + a23 b31 a21 b12 + a22 b22 + a23 b32 a21 a22 a23
b21
c31
= = b22
c32 a31 b11 + a32 b21 + a33 b31 a31 b12 + a32 b22 + a33 b32 a31 a32 a33
b31 b32
c41 c42 a41 b11 + a42 b21 + a43 b31 a41 b12 + a42 b22 + a43 b32 a41 a42 a43
Subscript notation
To work out if two matrices will multiply it is useful to subscript the matrix variable with row and column information,
thus
This can be determined by considering the dimensions of each matrix. For example if we write A2×3 to indicate that it
has 2 rows and 3 columns, then
A2×3 B3×5 = C2×5
is realised by cancelling the inner 3’s resulting in a 2 by 5 matrix.
5 . M u lt i p l i c at i o n o f pa rt i t i o n e d m at r i c e s
Any matrix multiplication can also be carried out on sub-matrices and vectors as long as each sub-matrix/vector is a valid
matrix calculation. (You can colour in the individual
" # articles in the equation above to demonstrate.)
G
For example if A = D f and B = then
hT
DG
AB =
f hT
28
6 . M at r i x i n v e r s e :
If a matrix is square and not singular it has an inverse such that
AA−1 = A−1 A = I
|A| = ad − bc
d −b d −b
−c a −c a
A−1 = =
|A| ad − bc
Computation of the inverse can be done by computing the matrix determinant of |A|, the matrix of minors, the
co-factor, and the ajoint/adjugate. Thus if X is the ajoint/ajugate of A
XA = AX = I|A|
Computing the determinant and the ajoint requires computing a matrix of minors and hence the co-factor matrix. The
co-factor matrix is then simply the transpose of the ajoint/ajugate.
>> max(svd(A))/min(svd(A))
A l g e b r a ru l e s :
A+B=B+A (Addition is commutative)
AB 6= BA (Multiplication not commutative)
A(BC) = (AB)C (Associative)
A(B + C) = AB + AC (Associative)
AI = AI = A
AA−1 = A−1 A = I (for non-singular and square matrix)
(AB)T = BT AT
(AB)−1 = B−1 A−1
29
M at r i x d e c o m p o s i t i o n
M at r i x t y p e s
C = CT such that cij = cji (If C is symmetric it must be square)
A matrix is symmetric if B = B T
A matrix is skew symmetric if B = −B T
A matrix is orthogonal if B −1 = B T
A matrix is positive definite if xT Bx is positive for all values of the vector x
The scalar (Ax)T (Ax) will always be zero or positive. Thus if the matrix B can be partitioned such that B = AT A it will
be positive definite.
O rt h o g o n a l m at r i c e s
If a vector is transformed by an orthogonal matrix then Euclidean metrics are conserved. That is if L(x) = xT x and
y = Ax then evidently L(y) = y T y = (Ax)T Ax = xT AT Ax
If the measures are conserved then L(x) = L(y) so AT A = I and hence orthogonality requires AT = A−1
E i g e n va l u e s a n d E i g e n v e c t o r s
|λI − A| = 0
• If A is real then complex Eigenvalues, if they occur, will be as complex conjugate pairs.
• A symmetric matrix has only real Eigenvalues and the Eigenvectors are orthogonal.
Given the Eigenvalues, the Eigenvectors are such that
AV = V diagλi
Where the columns of V are the Eigenvectors and diag lambda is a diagonal matrix of the eigenvalues.
If A is symmetric and [V, D] are the Eigenvectors and diagonalised Eigenvalues such that AV = V D. Then since
V −1 = V T , A−1 = V T D−1 V and since D is diagonal the inverse is now trivial...
S i n g u l a r va l u e d e c o m p o s i t i o n
For a matrix A we can partition A such that A = U DV T where U T U = I, V T V = I and D is diagonal. The diagonal
elements of D are known as singular values σ
q
σi = λi (AT A)
Transfer functions
Interesting identity
A(I + A)−1 = (I + A)−1 A
T h i n g s t o d o w i t h 2 x 2 m at r i c e s
Let
a b
A=
c d
determinate |A| = ad − bc
inverse
d b
−1 1 d −b ad−bc − ad−bc
A = = c a
|A| −c a − ad−bc ad−bc
Eigenvalues √
1/2 a + 1/2 d + 1/2 √a2 − 2 ad + d2 + 4 bc
λ=
1/2 a + 1/2 d − 1/2 a2 − 2 ad + d2 + 4 bc
30
U s e f u l M at l a b f u n c t i o n s
A*B A+B A\B A/B matrix multiplication, addition and division (left and right)
eye(n) create an identity matrix size n
svg(A) compute the singular value decomposition matrices of A
eig(A) compute the Eigenvalues and Eigenvectors of A
rank(A) compute rank (number of independent rows or columns in A)
cond(A) compute condition number (an indication of numerical error in the inverse)
det(A) compute determinant
inv(A) compute inverse
pinv(A) compute pseudo inverse = (AAT )−1 AT
adjoint(A) compute adjoint/adjugate
eigshow demonstration of Eigenvalues, (can drag the x vector)
References
[1] Lennart Ljung et al. Theory and practice of recursive identification. The MIT press, 1983 (cit. on p. 1).
[2] Lennart Ljung. System identification. Wiley Online Library, 1999. d o i: 10.1002/047134608X.W1046.pub2 (cit. on
p. 1).
[3] Lennart Ljung. System Identification: Theory For the User. (UoR call 003.1.LJU). Second Edition, Upper Saddle
River, N.J: Prentice Hall, 1999 (cit. on p. 1).
[4] Torsten Söderström and Petre Stoica. System identification. Prentice-Hall, Inc., 1988 (cit. on p. 1).
[5] Cleve B. Moler. Numerical Computing with MATLAB. SIAM, 2004. d o i: 10 . 1137 / 1 . 9780898717952. u r l:
https://uk.mathworks.com/moler/index_ncm.html.
[6] Cleve Moler. Professor SVD. http://www.mathworks.com/tagteam/35906_91425v00_clevescorner.pdf. 2006
(cit. on p. 30).
% System 1
%% Examples of positive definate matrices u=2-4*rand(1,1000); % zero mean uniform distribution
if exist(’k’,’var’)==0;k=8;end e=randn(1,1000);
if k >7; y(1)=0;y2(1)=0;
k=menu(’Choose quadratic form’,’random’,’rand pd’,’pos def’,’saddle’,’identity’,’set to m’); for i=2:1000;
end y(i)=0.8*y(i-1) + u(i-1) + e(i); % system 1
y2(i)=0.8*y2(i-1) + u(i-1) - 0.8*e(i-1) + e(i); % system 2
if k==1;M=randn(2);end end;
sr=100:200;% subrange of the data
% random pd
if k==2;M=randn(2);M=M*M’;end %figure(1); plot(sr,u(sr));xlabel(’t’);ylabel(’u(t)’);
%figure(2); plot(sr,y(sr));xlabel(’t’);ylabel(’y(t)’);
% Illustrative pos def figure(1);subplot(211); plot(sr,u(sr));xlabel(’t’);ylabel(’u(t)’);
if k==3;M=[.32 -.11;-.11 .52];end subplot(212);plot(sr,y(sr));xlabel(’t’);ylabel(’y(t)’);
% if k==4;M=[1.08 0.229;2.37 -.267];end
% illustrative saddle % Matlab code for least squares estimator:
if k==4;M=[-.072 1.373;.28 .18];end
31
% phi2=[[0 y(1:end-1)]’ [0 0 y(1:end-2)]’ [0 u(1:end-1)]’]; % Model output
yhat=phi*theta;
for i=2: 1000; figure(3);
phi(i,:)=[y(i-1) u(i-1)]; plot(sr,y(sr),’-’,sr,yhat(sr),’-.’);
end; legend({’System output’,’Model output’},’Location’,’Best’)
theta=(phi’*phi)\phi’*y’; xlabel(’t’);
32