Digital Signal Processing With Matlab Examples, Volume 3 (2017)
Digital Signal Processing With Matlab Examples, Volume 3 (2017)
123
Jose Maria Giron-Sierra
Systems Engineering and Automatic Control
Universidad Complutense de Madrid
Madrid
Spain
MATLAB is a registered trademark of The MathWorks, Inc., and is used with permission. The
MathWorks does not warrant the accuracy of the text or exercises in this book. This books use or
discussion of MATLAB software or related products does not constitute endorsement or sponsorship by
the MathWorks of a particular pedagogical approach or particular use of the MATLAB software.
This is the third book of a trilogy. As in the other books, a series of MATLAB
programs are embedded in the chapters for several purposes: to illustrate the
techniques, to provide implementation examples, to encourage for personal
exploration departing from a successful start.
The book has two parts, each having just one chapter. These chapters are long
and have a considerable number of bibliographic references.
When using a GPS on a car, sometimes it is not possible to keep contact with
satellites, like for instance inside tunnels. In this case, a model of the car motiona
dynamic modelcan be used for data substitution. The adequate combination of
measurements and models is the key idea of the Kalman lter, which is the central
topic of the rst part of the book. This lter was formulated for linear conditions.
There are modications for nonlinear conditions, like the extended Kalman lter, or
the unscented Kalman lter. A new idea is to use particle lters. These topics are
covered in the chapter under an important perspective: Bayesian ltering.
Compressed sensing has emerged as a promising idea. One of the intended
applications is networked devices or sensors, which are becoming a reality. This
topic is considered in the second part of the book. Some experiments that
demonstrate image denoising applications were included.
For easier reading of the book, the longer programs have been put in an
appendix. And a second appendix on optimization has been added to support some
contents of the last chapter.
The reader is invited to discover the profound interconnections and common-
alities that exist behind the variety of topics in this book. This common ground
would become surely the humus for the next signal processing future.
As said in the preface of the other books, our particular expertise on signal
processing has two main roots: research and teaching. I belong to the Faculty of
Physics, University Complutense of Madrid, Spain. During our experimental
research on autonomous vehicles, maritime drones, satellite control, etc., we
practiced main methods of digital signal processing, for the use of a variety of
sensors and for prediction of vehicle motions. From years ago, I teach Signal
Processing in a Master of Biomedical Physics, and a Master on New technologies.
v
vi Preface
The style of the programs included in the book is purposively simple enough.
The reader is invited to typeset the programs included in the book, for it would help
for catching coding details. Anyway, all programs are available from the book web
page: www.dacya.ucm.es/giron/SPBook3/Programs.
A lot of different materials have been used to erect this book: articles, blogs,
codes, experimentation. I tried to cite with adequate references all the pieces that
have been useful. If someone has been forgotten, please contact me. Most of the
references cited in the text are available from Internet. We have to express our
gratitude to the public information available in this way.
Please, send feedback and suggestions for further improvement and support.
Acknowledgments
Thanks to my University, my colleagues and my students. Since this and the other
book required a lot of time taken from nights, weekends and holidays, I have to
sincerely express my gratitude to my family.
vii
viii Contents
Figure 1.1 Keeping the car at a distance from the road border . . . . . . . . . 7
Figure 1.2 Prediction (P), measurement (M) and update (U) PDFs . . . . . . 9
Figure 1.3 Variation of K in function of y = x . . . . . . . . . . . . . . . . . . . . 10
Figure 1.4 The algorithm is a cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Figure 1.5 A two-tank system example . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Figure 1.6 System outputs (measurements) . . . . . . . . . . . . . . . . . . . . . . . . 19
Figure 1.7 System states, and states estimated by the Kalman lter . . . . . 19
Figure 1.8 Error evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Figure 1.9 Evolution of the Kalman gains . . . . . . . . . . . . . . . . . . . . . . . . 23
Figure 1.10 Evolution of the state covariance . . . . . . . . . . . . . . . . . . . . . . . 24
Figure 1.11 The prediction step, from left to right . . . . . . . . . . . . . . . . . . . 25
Figure 1.12 The measurement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Figure 1.13 Estimation of the next state . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Figure 1.14 Bayes net corresponding to Kalman lter . . . . . . . . . . . . . . . . 27
Figure 1.15 Satellite position under disturbances . . . . . . . . . . . . . . . . . . . . 33
Figure 1.16 Example of nonlinear function: arctan(). . . . . . . . . . . . . . . . . . 34
Figure 1.17 Original and propagated PDFs . . . . . . . . . . . . . . . . . . . . . . . . . 35
Figure 1.18 Propagation of a PDF through nonlinearity . . . . . . . . . . . . . . . 36
Figure 1.19 Propagated PDFs for sigma 0:7; 1; 2 . . . . . . . . . . . . . . . . . . 37
Figure 1.20 Propagation of a shifted PDF through nonlinearity . . . . . . . . . 38
Figure 1.21 Basic linear approximation using tangent . . . . . . . . . . . . . . . . 42
Figure 1.22 Falling body example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Figure 1.23 System states (cross marks) . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Figure 1.24 Distance measurement and drag . . . . . . . . . . . . . . . . . . . . . . . 45
Figure 1.25 The three non-zero components of the o f=o x Jacobian . . . . . 47
Figure 1.26 Propagation of ellipsoids (state N=43 -> 44)s . . . . . . . . . . . . . 49
Figure 1.27 System states (cross marks) under noisy conditions. . . . . . . . . 51
Figure 1.28 Distance measurement. Drag . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Figure 1.29 System states (cross marks), and states estimated
by the EKF (continuous) . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 54
xi
xii List of Figures
Figure 2.1 The line L and, a the ball B1=2 , b the ball B1 ,
c the ball B2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
Figure 2.2 An example of the solution paths obtained with LARS . . . . . . 159
Figure 2.3 Solution paths using LARS for diabetes set . . . . . . . . . . . . . . 162
Figure 2.4 Solution paths using LASSO for diabetes set . . . . . . . . . . . . . 162
Figure 2.5 Soft-thresholding operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
Figure 2.6 Application of ISTA for a sparse signal recovery
example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
Figure 2.7 Evolution of objective function along ISTA iterations . . . . . . . 172
Figure 2.8 A BP sparsest solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
Figure 2.9 Evolution of objective function k xk1 along ADMM
iterations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
Figure 2.10 Sparse solution obtained with OMP . . . . . . . . . . . . . . . . . . . . 180
Figure 2.11 Evolution of the norm of the residual . . . . . . . . . . . . . . . . . . . 181
Figure 2.12 The CS scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
Figure 2.13 An sparse signal being measured . . . . . . . . . . . . . . . . . . . . . . . 186
Figure 2.14 Recovered signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
Figure 2.15 Evolution ofkxk1 along iterations . . . . . . . . . . . . . . . . . . . . . . 187
Figure 2.16 A phase transition curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
Figure 2.17 Original image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
Figure 2.18 (right) Chan-Vese segmentation, (left) level set . . . . . . . . . . . . 197
Figure 2.19 A patch dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
Figure 2.20 The dictionary problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
Figure 2.21 Original picture, and image with added Gaussian noise. . . . . . 207
Figure 2.22 Patch dictionary obtained with K-SVD . . . . . . . . . . . . . . . . . . 207
Figure 2.23 Denoised image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
Figure 2.24 A synthetic image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
Figure 2.25 Example of gure during MCA process . . . . . . . . . . . . . . . . . 211
Figure 2.26 The original composite signal and its components . . . . . . . . . . 212
Figure 2.27 Visualization of banded matrix using spy() . . . . . . . . . . . . . . . 216
Figure 2.28 Visualization of the Bucky ball matrix structure . . . . . . . . . . . 217
Figure 2.29 Visualization of the bucky ball graph . . . . . . . . . . . . . . . . . . . 217
Figure 2.30 Visualization of HB/nnc1374 matrix using spy() . . . . . . . . . . . 218
Figure 2.31 Visualization of heat diffusion example . . . . . . . . . . . . . . . . . . 220
Figure 2.32 Effect of Gaussian diffusion, original on top . . . . . . . . . . . . . . 222
Figure 2.33 Effect of Gaussian anti-diffusion, original on top . . . . . . . . . . . 224
Figure 2.34 The diffusion coefcient and the corresponding
flux function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
Figure 2.35 Denoising of image with salt & pepper noise,
using P-M method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
Figure 2.36 Example of Bregman distance . . . . . . . . . . . . . . . . . . . . . . . . . 228
Figure 2.37 ROF total variation denoising using split Bregman . . . . . . . . . 233
Figure 2.38 Evolution of nuclear norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
Figure 2.39 A test of reconstruction quality . . . . . . . . . . . . . . . . . . . . . . . . 236
xiv List of Figures
xv
xvi Listings
1.1 Introduction
Consider the case of satellite tracking. You must determine where is your satellite,
using a large antenna. Signals from the satellite are noisy. Measurements of antenna
angles have some uncertainty margins. But you have something that may help you:
the satellite follows a known orbit, so at time T must be at position P. However this
help should be taken with caution, since there are orbit perturbations.
Satellite tracking is an example of a more general scenario. The target is to estimate
the state of a dynamic system, a state that is changing along time. The means you
have are measurements and a mathematical model of the system dynamics. These
two means should be combined as best as possible.
Some more examples could be useful to capture the nature of the problem to be
considered in this chapter.
According with the description given in a scientific conference, a research team
was developing a small UAV (unmanned aerial vehicle) to fly on their university
campus. They used distance measurement with an ultrasonic sensor, for flight altitude
control. Sometimes the altitude measurements were lost or completely wrong. In
such moments, these measurements were substituted by a reasonable value. At least
a simplistic (or even implicit) model should be used here for two reasons: to determine
that there is a wrong measurement, and to obtain a reasonable value.
Nowadays a similar problem is found with vehicular GPS. In certain circumstances
of the travelfor instance, a tunnelthe connection with satellites is lost, but the
information to the driver should continue.
Another example is the case of biosignals. When measuring electroencephalo-
grams (EEG) or electrocardiograms (ECG), etc., sometimes artifacts appear due to
bad electrodes contact, eye blinking, interferences, etc. These bad measurements
should be correctly identified as outliers, and some correction procedure should be
applied.
This chapter deals with optimal state estimation for dynamic systems. The
Bayesian methodology provides a general framework for this problem. Along time,
a set of important practical methods, which can be seen as Bayesian instances, have
been developed, such as the Kalman filter and the particle filter.
Given a dynamic system, the Bayesian approach to state estimation attempts
to obtain the posterior PDF of the sate vector using all the available information.
Part of this information is provided by a model of the system, and part is based
on measurements. The state estimation could be done with a recursive filter, which
repeats a prediction and an update operation.
Denote the posterior PDF at time step k as p(xk | Yk ), where Yk is the set of all
previous measurements Yk = { y j , j = 1, 2, . . . , k}.
The prediction operation propagates the posterior PDF from time step k 1 to k,
as follows:
p(x | Y ) = p(x | x ) p(x | Y ) dx (1.1)
k k1 k k1 k1 k1 k1
A B C
where A is the prior at k, B is given by the system model, and C is the posterior at
k 1.
The update operation takes into account the new measurement yk at k:
where w(k) is the process noise, and v(k) is the observation noise.
The first equation of the system model can be used for the term B, and the second
equation for the term L.
If some cases, with linear dynamics and measurement, the system model can be
a Gauss-Markov model:
Notice that, in order to align with the notation commonly employed in the Bayesian
filters literature, we denote vectors in boldface letters (in previous chapters we used
bars over letters).
1.1 Introduction 5
The Kalman filter [55] offers an optimal solution for state estimation, provided
the system is linear, and noises are Gaussian. It is a particular case of the Bayesian
recursive filter.
In more general cases with non-linear dynamics, the system model is based on
the two functions f() and h(). A linearization strategy could be applied to still use
the Kalman filter algorithm, and this is called Extended Kalman Filter (EKF). An
alternative is to propagate a few special points (sigma points) through the system
equations; with these points it is possible to approximate the prior and posterior PDFs.
An example of this alternative is the Unscented Kalman Filter (UKF). For non-
linear/non-Gaussian cases that do not tolerate approximations, the Particle Filter
could be used; the filter is based on the propagation of many points.
The chapter starts with the standard Kalman filter, after some preliminaries. Then
nonlinearities are considered, and there are sections on EKF, UKF, and particle
filters. All these methods are naturally linked to the use of computers or digital
processors, and so the world of numerical computation must be visited in another
section. Smoothing is another important field that is treated in Sect. 1.10. The last
sections intend to at least introduce important extensions and applications of optimal
state estimation.
Some limits should be admitted for this chapter, and therefore there is no space for
many related topics, like H-infinity, game theory, exact filters, etc. The last section
offer some links for the interested reader.
The chapter is mainly based on [8, 15, 19, 40, 72, 96].
In general, many problems related with state estimation remain to be adequately
solved. The field is open for more research.
1.2 Preliminaries
Scalar case:
mean : x = E(x(n))
variance : x2 = E((x(n) x )2 )
Two processes:
f(y) = E(x|y)
This is an example taken from [95]. It is the case of driving a car, keeping a distance
x from the border of the road (Fig. 1.1).
In this example is useful to take into account that:
N (1 , 1 ) N (2 , 2 ) = const N (, ) (1.7)
with:
= 11 1 + 21 2
(1.8)
1 = 11 + 21
2y1 2y0
x = y(0) + y(1) (1.9)
2y0 + 2y1 2y0 + 2y1
1 1 1
= 2 + 2 (1.10)
x2 y0 y1
2y0
x(1|1) = x = y(0) + (y(1) y(0)) = x(1|0) + K (1)(y1 x(1|0))
2y0 + 2y1
(1.11)
where:
2y0
K (1) = (1.12)
2y0 + 2y1
Equation (1.11) has the form of a Kalman filter; K () is the Kalman gain. The equation
says that the best estimate can be obtained using the previous best estimate and a
correction term. This term compares the latest measurement with the previous best
estimate.
Likewise, the variance could be written as follows:
The term w is random perturbation with variance w2 . The lateral velocity is u, set
equal to zero. After some time T , the best estimate (prediction) would be:
The increase of variance is not good. A new measurement is welcome. Suppose that
at time 2 a new measurement is taken. Again the product of Gaussians appears, as
we combine the prediction x(2|1) and the measurement y(2). Therefore:
where:
x2 (2|1)
K (2) = (1.19)
x2 (2|1) + 2y2
1.2 Preliminaries 9
0.2
P
0.15
M
0.1
0.05
0
-10 -5 0 5 10 15
x
The philosophy of the Kalman filter is to obtain better estimates by combining pre-
diction and measurement (the combination of two estimates is better). The practical
procedure is to update the prediction with the measurement.
Figure 1.2 shows what happens with the PDFs of the prediction (P), the measure-
ment (M) and the update (U). The update PDF has the smallest variance, so it is a
better estimation.
0.8
0.6
K
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
sigy/sigx
Another aspect of the Kalman filter philosophy is that, in order to estimate the
present system state, the value of K () is modulated according with the confidence
offered by measurements or by prediction. Suppose that the prediction variance is
constant, then, if measurement uncertainty increases, K decreases: it could be said
that the correction exerted by K becomes more prudent.
Figure 1.3 shows how K depends on y /x . The figure has been generated with
the Program 1.17.
This subsection focuses on the propagation of mean and covariance. The basis of the
next study is an important lemma, which applies to a partition of a set of Gaussian
variables:
1.2 Preliminaries 11
x1
x =
x2
With:
x1 S11 S12
x = ; Sx =
x2 S21 S22
Taking into account the model, the four covariance components are:
x x = A (n) A T + w (1.23)
x y = A (n) C T + wv (1.24)
12 1 Kalman Filter, Particle Filter and Other Bayesian Filters
yx = C (n) A T + wv
T
(1.25)
yy = C (n) C T + v (1.26)
Thus, the mean and the variance of the partitioned process are the following:
A x(n) B
p = + u(n) (1.27)
C x(n) 0
A (n) A T + w A (n) C T + wv
p = (1.28)
C (n) A T + wv
T
C (n) C T + v
where:
1
K (n) = x y yy = [A (n) C T + wv ] [C (n) C T + v ]1 (1.31)
The last three equations provide a one-step prediction. The term K (n) is called the
Kalman gain.
And the important point is that since we know the conditional mean, then we have
the minimum variance estimate (MVE) of the system state: this is the Kalman filter.
This subsection links with subsection (5.2.5) where a simple example of Wiener
filter was described. In that example, no model of x(n) was considered. A recursive
estimation was obtained, with the following expression:
As it was said in the second book, the above expression has the typical form of
recursive estimators, where the estimation is improved in function of the estimation
error.
1.2 Preliminaries 13
In the preceding subsection the Kalman filter was derived by studying the propa-
gation of means and variances. Notice that x(n + 1) was obtained using y(n). This
is prediction.
Now, let us establish a second version of the filter where x(n + 1) is obtained using
y(n + 1). This is filtering. A scalar Gauss-Markov example will be considered. The
derivation of the Kalman filter will be done based on minimization of estimation
variance. This is a rather long derivation borrowed from [8]. Although long, it is an
interesting deduction exercise that exploits orthogonality relations.
Let us proceed along three steps:
1. Problem statement
The scalar system is the following:
where A and C are constants (they are not matrices, however we prefer to use
capital letters).
Assumptions:
x(0) = 0; w(0) = 0
(it will reach the typical form, after the coming development)
The estimation variance is:
(n + 1)
= 2 E {(x(n + 1) x(n + 1)) y(n + 1) } = 0 (1.38)
k(n + 1)
14 1 Kalman Filter, Particle Filter and Other Bayesian Filters
Due to E{v(n + 1) x(n)} = 0, and to the rest of orthogonality relations, the pre-
vious equation reduces to:
Now, let us find a convenient expression of the variance. From definitions, one
obtains:
Since:
x(n + 1) = f (n + 1) x(n) + k(n + 1) y(n + 1)
Considering that:
E{e(n + 1) y(n + 1)} = 0 and y(n + 1) = C x(n + 1) + v(n + 1)
After squaring most terms do cancel, because error and noises are uncorrelated. The
result is:
C [ A2 (n) + w2 ]
k(n + 1) = (1.55)
v2 + C 2 w2 + C 2 A2 (n)
To conclude this subsection, let us write a summary. The Kalman filter is given by
the next three equations:
C [ A2 (n) + w2 ]
k(n + 1) = (1.57)
v2 + C 2 w2 + C 2 A2 (n)
The Kalman filter was introduced in 1960 [55]. Now, there are thousands of related
publications. This fact reflects the eminent importance of the Kalman filter, which is
a fundamental tool for many real life applications.
An overview of the academic literature shows that the Kalman filter can be derived
in several ways. Likewise, there are several different notations that could be used.
This section establishes the Kalman filter equations, and then introduces a typical
algorithm for its use. Then, there is a series of subsections devoted to important
aspects related to the Kalman filter.
Let us recapitulate what we have seen, and write now a summary of the Kalman
filter equations.
The system (or signal) model is:
In the pair of equations above, the first equation is frequently called the transition
equation (transition from one state to the next one), the second equation corresponds
to measurement.
In order to simplify the expressions, let us introduce the following matrices:
N (n + 1) = C (n) C T + v = yy (1.62)
The most used versions of the Kalman filter are the (at present) filter, and the one-step
prediction filter.
Usually the literature does not includes wv , either because noises are uncorre-
lated, or because, if noises are correlated, the problem could be re-written to make
this term disappear.
It is also common in the literature to replace names as follows:
Note that the AKF filter corresponds to the development in subsection (8.2.3), based
on [8]; while the OPKF filter corresponds to subsection (8.2.2), based on [40].
The standard way for the application of the Kalman filter is by repeating a two-step
algorithm. This is represented with the diagram shown in Fig. 1.4.
The equations to be applied in each step are written below. They correspond to
the AKF filter with input. Note slight changes of notation, which are introduced for
easier translation to code.
18 1 Kalman Filter, Particle Filter and Other Bayesian Filters
xe , P
xa , M
K
(a) Prediction
xa (n + 1) = A xe (n) + B u(n) (1.71)
(b) Update
h1
h2
R1
R2
1.3 Kalman Filter 19
0.8 y1
0.6
0.4
y2
0.2
-0.2
0 5 10 15 20 25 30 35 40
sampling periods
0.6
xe1
0.4
xe2
0.2
x2
0
-0.2
0 5 10 15 20 25 30 35 40
sampling periods
Program 1.3 includes an implementation of the AKF Kalman filter. The program
is somewhat long because it has to prepare for the algorithm and reserve space for
matrices and vectors.
The Program 1.3 includes a simulation of the noisy process. Figure 1.6 depicts the
outputs of the system, which are the measurements of tank heights. A great virtue of
the Kalman filter is that it is able to get good estimates of the states from severely
corrupted measurements.
Figure 1.7 compares the evolution of the 2-variable system, in continuous curves,
and the state estimation yield by the Kalman filter, depicted by xmarks. Since perhaps
the initial system state is not known, the program considers different initial states for
the system and for the Kalman filter. The initial values of covariance matrices are set
20 1 Kalman Filter, Particle Filter and Other Bayesian Filters
rym(:,nn)=ym;
%
%Prediction
xa=(A*xe)+(B*u); %a priori state
M(:,:,nn+1)=(A*P(:,:,nn)*A')+ Sw;
%Update
K(:,:,nn+1)=(M(:,:,nn+1)*C')*inv((C*M(:,:,nn+1)*C')+Sv);
P(:,:,nn+1)=M(:,:,nn+1)-(K(:,:,nn+1)*C*M(:,:,nn+1));
%estimated (a posteriori) state:
xe=xa+(K(:,:,nn+1)*(ym-(C*xa)));
end;
%------------------------------------------
% display of system outputs
figure(1)
plot([0 Nf],[0 0],'g'); hold on; %horizontal axis
plot([0 0],[-0.2 1.2],'k'); %vertical axis
plot(rym(1,:),'r'); %plots y1
plot(rym(2,:),'b'); %plots y2
xlabel('sampling periods');
title('system outputs');
% display of state evolution
figure(2)
plot([0 Nf],[0 0],'g'); hold on; %horizontal axis
plot([0 0],[-0.2 1.2],'k'); %vertical axis
plot(x1,'r'); %plots x1
plot(x2,'b'); %plots x2
plot(xe1,'mx'); %plots xe1
plot(xe2,'kx'); %plots xe2
xlabel('sampling periods');
title('system and Kalman filter states');
0.5
er1
0.4
0.3
0.2
0.1
-0.1
er2
-0.2
0 5 10 15 20 25 30 35 40
sampling periods
If the error dynamics is stable, the values of K (n) converge to constants, and so:
(n + 1) = (n) (1.78)
This equation translates to an algebraic Riccati equation (ARE), which for OPKF
[40] is:
A A T + [A C T [C C T + v ]1 C A T ] w = 0 (1.79)
1.3 Kalman Filter 23
0.8 0.8
0.6 0.6
K11 K12
0.4 0.4
0.2 0.2
0 0
0 10 20 30 40 0 10 20 30 40
0.8 0.8
0.6 0.6
K21 K22
0.4 0.4
0.2 0.2
0 0
0 10 20 30 40 0 10 20 30 40
sampling periods
with:
= A KC
(1.81)
K = A C T [A C T + R]1
In certain cases, it would be convenient to use from the beginning this constant K ,
which could be pre-computed.
Figure 1.9 shows the evolution of the Kalman filter gains in the two-tank system
example. Like before, the figure has been generated using the extended version of
Program 1.3. Clearly, the Kalman gains rapidly evolve to constant values.
Continuing with the example of the two-tank system, Fig. 1.10 shows the evolution
of the state covariance (the four matrix components). It has been computed with the
extended version of Program 1.3. It is clear that all components tend to constant
values.
As said before, a main idea of the Kalman filter is to combine prediction and mea-
surement to get good state estimation (the best in terms of minimum variance). An
24 1 Kalman Filter, Particle Filter and Other Bayesian Filters
-4
x 10 x 10 -4
6 6
4 4
P11 P12
2 2
0 0
0 10 20 30 40 0 10 20 30 40
-4 -4
x 10 x 10
6 6
4 4
P21 P22
2 2
0 0
0 10 20 30 40 0 10 20 30 40
sampling periods
interesting feature of the two-tank example being used in this section, is that it is a
two-dimensional system so PDFs of the variables could be plotted as ellipsoids. This
yields a valuable opportunity to graphically illustrate the Kalman filter steps.
The next three figures have been generated with a program included in Appendix
B. This program focuses on the transition from state number 3 to state number 4.
Figure 1.11 corresponds to the prediction step. The transition equation is used
to predict the next state xa (n + 1) using the present estimated state xe (n). There is
process noise, with covariance Sw, that increases the uncertainty of the predicted
state.
Figure 1.12 corresponds to the output measurement equation. There is measure-
ment noise, with covariance Sv, that increases the uncertainty of the result. The actual
measurement, provided by the system simulation, has been marked with a star on
the right-hand plot. Recall that the update step considers the error between the actual
measurement and C xa (n + 1) (this term has been called estimated y in Figs. 1.12
and 1.13).
Finally, Fig. 1.13 contains, from left to right, the two things to combine in the
update step, prediction and measurement, and the final result. Notice that the result,
which is the estimated state, has smaller variance.
1.3 Kalman Filter 25
0. 9 0.9
0.6 0.6
0.5 0.5
x1
x1
0.4 0.4
0.3 0.3
0.2 0.2
0.1
Sw 0.1
0 Process noise 0
-0.1 -0.1
-0. 1 0 0.1
1 0. 2 0. 3 -0. 1 0 0. 1 0. 2 0. 3
x2 x2
0.9 0.9
0.7 0.7
xa(n+1)
0.6 0.6
x1
0.4 0.4
0.3 0.3
0.2 0.2
0.1 Sv 0.1
Measurement
0 noise 0
-0.1 -0.1
-0.1 0 0.1 0.2 0.3 0.4 -0.1 0 0.1 0.2 0.3 0.4
x2 x2
xx1
xx1
xx1
0 0 0
It is found that the zeros of this transfer function are given by:
det( z I U ) = 0 (1.88)
Consequently, the zeros of the Kalman filter are placed on the poles of the observation
noise.
When it is desired that the Kalman filter rejects certain frequencies, it is opportune
to raise these frequencies in the observation noise model.
1.3 Kalman Filter 27
| k)
P(yk|x
Yk-1 Yk
Consider the Bayes net depicted in Fig. 1.14. It corresponds to the Gauss-Markov
model used in the Kalman filter.
Notice that the diagram can be regarded as a representation of a Hidden Markov
Model (HMM), where the observed variables have been represented as shaded
squares.
The filtering problem is to estimate, sequentially, the states of a system as mea-
surements are obtained. The Kalman filter gives a procedure to get p(xk |Yk ).
The only difference between both equations is evident: to use K or K (n). Clearly,
the Kalman filter can be considered an optimal state observer. Likewise, differences
are even lighter if the stationary Kalman filter is chosen; in this case, Kalman can be
regarded as an algorithm to compute a suboptimal K for the observer.
28 1 Kalman Filter, Particle Filter and Other Bayesian Filters
1.3.3.3 Innovations
Along the treatment of Kalman filter aspects, some authors [72] consider the follow-
ing difference:
(n + 1) = y(n + 1) y(n + 1|n) (1.91)
Therefore, the first equation of the AKF Kalman filter could be written as follows:
Notice that the innovations correspond to the error term [y(n + 1) C A x(n)].
It can be shown that the innovations process is a zero-mean Gaussian white noise.
The covariance matrix of the innovations process is:
Looking now at the OPKF filter, the error term is [y(n) C x(n)]. It is the case that
authors concerned with this type of filter [40], define innovations as follows:
In this form, the filter could be interpreted as a model that generates the sequence
y(n); it is denoted then as innovations model.
Another formulation could be:
This last expression could be interpreted as a white noise generator, departing from
the sequence y(n). In this case, it is a whitening filter.
1.3 Kalman Filter 29
The Kalman filter is connected with other topics, mainly because it is a recursive
procedure to minimize a quadratic criterion.
Given a vector of data d(n) and a set of parameters (n) to be estimated, the parameter
identification problem could be written as follows [40]:
(n + 1) = (n) (1.100)
The equations above are a particular case of the Gauss-Markov model for A = I ,
B = 0, C = d(n)T , w(n) = 0. In this case, the Kalman filter that gets the optimal
estimation of (n) is:
P(n)d(n + 1)
K (n + 1) = (1.103)
v + d(n + 1)T P(n) d(n)
Notice that the equations above are the same already obtained for least squares
recursive parameter identification (see the section on parameter identification in
Chap. 6). Of course, this is not surprising.
Let us use the OPKF innovations model for a single-input single output system:
a1 1 0 . . . 0 b1 K 1 (n)
a2 0 1 . . . 0 b2 K 2 (n)
A = ; B=
.
; K (n) =
..
(1.107)
... ... ... ... .. .
am 0 0 . . . 1 bm K m (n)
C = [1 0 . . . 0] (1.108)
By means of successive substitutions and using these matrices, the innovations model
can be written as an ARMAX model:
with:
A(q 1 ) = 1 + a1 q 1 + a2 q 2 + . . . + am q m (1.110)
B(q 1 ) = b1 q 1 + b2 q 2 + . . . + bm q m (1.111)
C(n, q 1 ) = 1 + (K 1 (n 1) + a1 ) q 1 + (K 2 (n 1) + a2 )q 2 + . . .
+(K m (n 1) + am )q m
(1.112)
In the case of a stationary Kalman filter, C(n, q 1 ) would be C( q 1 ).
The Kalman filter could be seen as a model that can be given in ARMAX format.
Likewise, the ARMAX model could be interpreted as a Kalman filter.
It was soon noticed, in the first aerospace applications of the Kalman filter, that the
programs may suffer from numerical instability. Numerical errors may yield non-
symmetric or non positive definite covariance matrix P(n).
One of the difficulties on first spacecrafts was the limited digital processing pre-
cision (15 bits arithmetics). Square root algorithms were introduced (Potter 1963)
to get more precision with the same wordlength. The Cholesky factorization (upper
and lower triangular factors) was used.
An alternative factorization was introduced (Bierman 1977) that do not require
explicit square root computations. It is the U-D factorization.
MATLAB provides the chol() function for Cholesky factorization, and the ldl()
function for U-D factorization.
There are some techniques to avoid the loss of P(n) symmetry. One is to sym-
metrise the matrix at each algorithm step by averaging with its transpose. Another
is to use and propagate just the upper or the lower triangular part of the matrix. And
another is to use the Josephs form:
1.3 Kalman Filter 31
Some difficulties could appear with the initialization of variables in the Kalman filter.
For instance, having no knowledge of state error covariance could lead to suppose
large or infinite values as starting point. The consequence could be an undetermined
value of K, (/). This problem could be avoided by using the inverse of the state
error covariance.
The idea is to use the following equations:
This is an equivalent alternative to the AKF Kalman filter. The filtering algorithm
would be:
L = C T v1 (1.116)
-Iterate:
xa (n) = A xe (n 1) + B u(n 1) (1.117)
n n+1 (1.122)
32 1 Kalman Filter, Particle Filter and Other Bayesian Filters
The inverse 1 of the covariance matrix is called information matrix (also called
Fisher information matrix). This is the reason for the name information filter.
Recently, the research on mobile sensors and robots has shown preference for the
information filter, since it leads to sparse covariance representations [105, 114].
It is usual in real life applications to face nonlinearities. Then, instead of the linear
model that has been used up to now in this chapter, it would be more appropriate to
represent the system with the following equations:
The nonlinearities could appear in the transition equation and/or in the measurement
equation.
Examples of nonlinear phenomena are saturations, thresholds, or nonlinear laws
with trigonometric functions or exponentials or squares, etc. In many cases sensors,
friction, logarithmic variables, modulation, etc. are the cause of nonlinearity.
There are versions of the Kalman filter that are able to cope with certain nonlinear
conditions. Several sections of this chapter cover this aspect. Now, it is opportune
to focus on a preliminary consideration of nonlinear conditions under the optics of
Kalman filtering..
500 M
400
300
200
100
0
-600 -400 -200 0 200 400 600
the vertical position, and the average of the horizontal position, you get the point
marked with a cross in the figure (denoted M). This point is deviated from the correct
expected value.
The change of coordinates, from polar to Cartesian, is nonlinear. The propagation
of statistical distributions through nonlinear transformations should be handled with
care.
ymean=sum(py/Np);
%display
figure(1)
plot(px,py,'r.'); hold on; %the points
plot([0 0],[0 r0+20],'b'); %vertical line
plot(0,r0,'k+','MarkerSize',12); %+ for reference satellite
%
%X for mean cartesian position:
plot(xmean,ymean,'kx','MarkerSize',12);
%
title('Satellite position data');
Now, let us study in more detail a nonlinear measurement case. The original
data have a Gaussian distribution. The nonlinear function, due to the measurement
method, is arctan(). For instance, you measure an angle using x-y measurements.
Figure 1.16 depicts the nonlinear behaviour of the arctan() function (which is
atan() in MATLAB).
Figure 1.17 shows how a Gaussian PDF, with = 0.1, propagates through the
nonlinearity. It can be noticed that the original and the propagated PDFs are almost
equal. Figures 1.16 and 1.17 have been generated with the Program 1.5, which uses
a simple code vectorization for speed.
0.5
-0.5
-1
-1.5
-10 -8 -6 -4 -2 0 2 4 6 8 10
1.4 Nonlinear Conditions 35
2000
1500
1000
500
0
-0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5
after
1 1 nonlinear
measurement
0.5 0.5
0 0
-0.5 -0.5
-1 -1
-1.5 -1.5
0 2000 -10 -5 0 5 10
2000
before
0
-10 -5 0 5 10
Figure 1.18 shows a practical perspective of the PDF propagation through the
nonlinear measurement function. The standard deviation of the Gaussian PDF has
been set to 0.5. This kind of representation will be used again in next sections.
It is clear that the arctan(x) function is almost linear for small values of x. A
curvature appears when x becomes larger, and then saturation becomes predominant
as x is further increased. A program has been made to investigate how the Gaussian
PDF propagates when the standard deviation is large enough to enter into nonlinear
effects. This is Program 1.7. The propagation has been applied to three Gaussian
PDFs, the narrower with = 0.7, the intermediate with = 1, the wider with
= 2.
Figure 1.19 shows the results. When the standard deviation becomes larger, the
propagated PDF can exhibit two peaks.
0 0 0
-1 0 1 -1 0 1 -1 0 1
It may happen that experimental data have a non-zero mean. In other words, the
corresponding PDF is horizontally shifted. If the data are propagated through the
arctan() function the result is a non-symmetric PDF.
Figure 1.20 shows a possible situation, when the original PDF, with = 2, is
shifted 0.8 to the right.
Notice in Fig. 1.20 that the propagated PDF is asymmetrical. The mean of the
propagated PDF has been also depicted with a cross on the vertical axis. Following
0 0
-0.5 -0.5
-1 -1
-1.5 -1.5
0 2000 -10 -5 0 5 10
before
2000
0
-10 -5 0 5 10
1.4 Nonlinear Conditions 39
the arrows, it is clear that the mean of the original Gaussian PDF propagates to a
different point. Figure 1.20 has been generated with an extension of the Program 1.6
that has been included in Appendix B.
Changing of coordinates, like from polar coordinates to Cartesian, can be the cause
of nonlinearity. This has been shown in the previous subsection. Actually, many
applications involving tracking or positioning of vehicles or objects incur in this
situation.
In two dimensions, the tangent to a curve f (x) can be obtained using the first
derivative d f (x)/d x. A generalization of the first derivative is the gradient of a
scalar function of several variables. A larger generalization is the Jacobian, which
can be described as the first derivative of a vector function of several variables. Here
is the expression of a Jacobian:
f f1 f1 f1
x1
1
x2 x3
... x N
f2 f2
... ... f2
f x1 x2 x N
Jf = =
... ......
... (1.125)
x
... ...... ...
fM fM fM
x1 x2
. .. ... x N
The use of Jacobians appears quite naturally when dealing with changes of coordi-
nates. For instance, if you change from variables x, y to variables u, v, the differen-
tials of surface are related as follows:
d x d y = |J | du dv (1.126)
The Jacobian tells us how a surface expands or shrinks when coordinates are trans-
formed. For instance, when changing from polar to Cartesian coordinates:
x = r cos
(1.128)
y = r sin
40 1 Kalman Filter, Particle Filter and Other Bayesian Filters
In the case of satellites or aerial vehicles it is pertinent to consider the relation between
Cartesian and spherical coordinates. The Jacobian in this case is:
x x x
r
y cos sin r sin sin r cos cos
J =
r
y
y
=
sin sin r cos sin r sin cos (1.131)
z z z cos 0 r sin
r
This relation can be generalized using the inverse of the Jacobian and joint distribu-
tions. In particular, for a change of coordinates,
py = px J 1
f (1.133)
If the Jacobian is evaluated for = 0, the eigenvalues of the matrix are j g/L.
If it is evaluated for = , the eigenvalues are g/L. For real eigenvalues, it
can be shown that if any eigenvalue is positive the state is unstable. For complex
eigenvalues, if any eigenvalue has positive real part the state is unstable. Therefore,
when = the pendulum is unstable.
A generalization of the second derivative is the Hessian. Given a scalar function
of several variables, the Hessian has the following expression:
2 f 2 f 2 f 2 f
x12 x1 x2 x1 x3
... x1 x N
2 f 2 f 2 f
2 f ... ...
x2 x1 x22 x2 x N
Hf = = ... ...... ... (1.137)
x2
... ...... ...
2
f 2 f 2 f
x N x1 x N x2
... ... x N2
When there are nonlinearities, it is very common to use linear approximations. A basic
case is represented in Fig. 1.21 where a tangent is used to approximate the nonlinear
curve f (x) near the point (x0 , f (x 0 )). Obviously, the quality of the approximation
depends on the shape of the curve.
In mathematical terms, there is a curve:
y = f (x) (1.139)
f(x)
x0 x
The approximation could be improved using a Taylor series, with higher order deriv-
atives.
Also, this approach can be generalized for n dimensions. So it is possible to write:
f(x) 1 2 f(x) 1 3 f(x)
f(x) f(x0 ) + x + x 2
+ x3 + . . .
x x0 2 x2 x0 3! x3 x0
(1.141)
Suppose that x is a Gaussian variable with covariance P. It can be shown that the
mean and covariance of f(x) are:
1 d 2 f(x) 1 d 4 f(x)
y f(x0 ) + P + E( x4 ) + . . . (1.142)
2 dx2 x0 2 dx4 x0
1 d 2 f(x)
Py F P F T
+ (E( x4 ) + . . . (1.143)
2 4! dx2 x0
The approximation by Taylor series can be applied to the type of models being used
in this chapter. For instance, take a model of nonlinear state transition:
x = f(x, w) (1.144)
y = h(x, v) (1.146)
The next sections of this chapter are devoted to filtering in nonlinear situations. A
common example will be used along these sections. It is the case of a falling body
being tracked by radar [61]. The body falls from high altitude, where atmospheric
drag is small. As altitude decreases the density of air increases and the drag changes.
Figure 1.22 pictures the example. The body falls vertically. The radar is placed a
distance L from the body vertical. The radar measures the distance y from the radar
to the body.
Body state variables are chosen such that x1 is altitude, x2 is velocity, and x3
is the ballistic coefficient. The falling body is subject to air drag, which could be
approximated with:
x22
drag = d = (1.147)
2 x3
x1
= 0 exp( ) (1.148)
k
y
x1
L
44 1 Kalman Filter, Particle Filter and Other Bayesian Filters
x1 = x2
x2 = d + g (1.149)
x3 = 0
Notice the two nonlinearities of the example: the drag, and the square root.
A simple Euler discretization is chosen for the motion equations. Hence:
x1 (n + 1) = x1 (n) + x2 (n) T
x2 (n + 1) = x2 (n) + (d + g) T (1.151)
x3 (n + 1) = x3 (n)
Program 1.8 simulates the evolution of body variables. Figure 1.23 shows the evolu-
tion of altitude and velocity.
x 10 4
12 1000
altitude velocity
0
10
-1000
8
-2000
6
-3000
4
-4000
2
-5000
0 -6000
0 10 20 30 40 0 10 20 30 40
seconds seconds
4
x 10
12 1000
700
8
600
6 500
400
4
300
200
2
100
0 0
0 10 20 30 40 0 10 20 30 40
seconds seconds
Figure 1.24 shows the evolution of distance measurements obtained by the radar,
and the evolution of air drag.
0.02 0.1
0 0 0
-0. 02 -0.1
-0.5
-0.04 -0.2
-0.06 -0.3 -1
-0.08 -0.4
-1.5
-0.1 -0.5
-2
-0.12 -0.6
jacobian term f21 jacobian term f22 jacobian term f23
-0.14 -2.5
0 20 40 0 10 20 30 40 0 20 40
seconds seconds seconds
where:
x22
f 21 = (1.154)
2 k x3
x2
f 22 = (1.155)
x3
x22
f 23 = (1.156)
2 x32
and:
h x1
= [ 0 0] (1.157)
x L 2 + x12
h
= 1 (1.158)
v
Program 1.9 computes the value of the three non-zero components of the f / x
Jacobian along the body fall. The program includes the same simulations as in the
Program 1.8. Figure 1.25 shows the evolution of the three non-zero Jacobian com-
ponents. They all have a minimum peak corresponding to maximum drag.
48 1 Kalman Filter, Particle Filter and Other Bayesian Filters
To complete the initial analysis of the example, it is interesting to portray how the
nonlinearities influence the state transitions.
In this case, for a fast and simple computation, the option was to create a set of
perturbed states forming circles around a selected central state. The state number 43
(near the drag peak) was selected as the central state. Then, the perturbed states were
propagated to the next state, state number 44, according with the transition equation.
The result was plotted in Fig. 1.26. This all was the work of Program 1.10.
The propagated perturbed states form a series of closed curves showing certain
asymmetries. Notice the scaling of the vertical axis.
4 4
x 10 x 10
-2.2 -2.2
-2. 3 -2. 3
before after
-2.4 -2.4
-2.5 -2.5
-2.6 -2.6
10*x2
10*x2
-2.7 -2.7
-2. 8 -2. 8
-2.9 -2.9
-3 -3
-3.1 -3.1
-3.2 -3.2
1.4 1.6 1.8 2 2.2 1.4 1.6 1.8 2 2.2
x1 4 x1 4
x 10 x 10
%display
m=1:Ne+1;
figure(1)
subplot(1,2,1)
plot(ax1(m),10*ax2(m),'k'); hold on;
title('before');xlabel('x1'); ylabel('10*x2');
axis([1.4e4 2.2e4 -3.2e4 -2.2e4]);
subplot(1,2,2);
plot(px1(m),10*px2(m),'k'); hold on;
title('after');xlabel('x1'); ylabel('10*x2');
axis([1.4e4 2.2e4 -3.2e4 -2.2e4]);
end;
In the next sections, additive process and observation noises are considered. That
means simple changes to the equations of the example:
x 10 4
12 1000
altitude velocity
0
10
-1000
8
-2000
6
-3000
4
-4000
2
-5000
0 -6000
0 10 20 30 40 0 10 20 30 40
seconds seconds
4
x 10
12 1000
700
8
600
6 500
400
4
300
200
2
100
0 0
0 10 20 30 40 0 10 20 30 40
seconds seconds
Next Figs. 1.27 and 1.28 show the evolution of state variables, measurements, and air
drag during the fall under noisy conditions. These figures have been generated with
a program that has been included in Appendix B and that is similar to the programs
listed in this subsection.
The Extended Kalman Filter (EKF) is based on the linearization of the transition
and the measurement equations [96, 119]. In this way, it can be used for nonlinear
situations as far as the linear approximations are good enough.
The EKF uses a first order Taylor approximation of nonlinear functions via Jaco-
bians, as it was described in subsection (8.4.3). The process and the observation
noises are modelled as Gaussians. The distribution of propagated states, obtained
with the transition equation, is approximated as Gaussian PDF. Likewise, the distri-
bution of measurements, obtained with the measurement equation, is approximated
as Gaussian PDF.
The EKF is applied as an algorithm similar to the standard AKF Kalman filter,
repeating prediction and update steps.
1.5 Extended Kalman Filter (EKF) 53
(a) Prediction
xa (n + 1) = f (xe (n) , u(n)) (1.163)
(b) Update
K (n + 1) = M(n + 1) H (n)T
(1.165)
[H (n) M(n + 1) H (n)T + V (n) v V (n)T ]1
The transition equation is directly used to predict the next state. According with
subsection (8.4.3.), the associated covariance matrix M is computed using the fol-
lowing Jacobians, evaluated at xe (n), w(n):
df(x, w)
F(n) = .
dx
df(x, w)
W (n) = .
dw
The update step computes the covariance matrix P using the following Jacobians,
evaluated at xa (n + 1), v(n):
dh(x, v)
H (n) = .
dx
dh(x, v)
V (n) = .
dv
Notice that the measurement equation is directly used to compute the estimation
error (see the last equation in the update step).
Comparing the equations of the prediction and update steps, with these steps in
the standard Kalman filter, it could be said that the Jacobian F(n) plays the role of
the A matrix, and the Jacobian H (n) plays the role of the C matrix.
54 1 Kalman Filter, Particle Filter and Other Bayesian Filters
4
x 10
12 1000
altitude velocity
0
10
-1000
8
-2000
6
-3000
4
-4000
2
-5000
0 -6000
0 10 20 30 40 0 10 20 30 40
seconds seconds
Fig. 1.29 System states (cross marks), and states estimated by the EKF (continuous)
bn=randn(3,Nf); sn=zeros(3,Nf);
sn(1,:)=sqrt(Sw(1,1))*bn(1,:); %state noise along simulation
sn(2,:)=sqrt(Sw(2,2))*bn(2,:); %" " "
sn(3,:)=sqrt(Sw(3,3))*bn(3,:); %" " "
%observation noise
Sv=10^6; %cov.
on=sqrt(Sv)*randn(1,Nf); %observation noise along simulation
%------------------------------------------
%Prepare for filtering
%space for matrices
K=zeros(3,Nf); M=zeros(3,3,Nf); P=zeros(3,3,Nf);
%space for recording er(n), xe(n)
rer=zeros(3,Nf); rxe=zeros(3,Nf);
W=eye(3,3); %process noise jacobian
V=1; %observation noise jacobian
%------------------------------------------
%Behaviour of the system and the filter after initial state
x=[10^5; -5000; 400]; %initial state
xe=x; % initial values of filter state
xa=xe; %initial intermediate state
nn=1;
while nn<Nf+1,
%estimation recording
rxe(:,nn)=xe; %state
rer(:,nn)=x-xe; %error
%system
rx(:,nn)=x; %state recording
rho=rho0*exp(-x(1)/k); %air density
d=(rho*(x(2)^2))/(2*x(3)); %drag
%next system state
x(1)=x(1)+(x(2)*T)+sn(1,nn);
x(2)=x(2)+((g+d)*T)+sn(2,nn);
x(3)=x(3)+sn(3,nn);
%system output
y=on(nn)+sqrt(L2+(x(1)^2));
ym=y; %measurement
%Prediction
%a priori state
rho=rho0*exp(-xe(1)/k); %air density
d=(rho*(xe(2)^2))/(2*xe(3)); %drag
xa(1)=xe(1)+(xe(2)*T);
xa(2)=xe(2)+((g+d)*T);
xa(3)=xe(3);
%a priori cov.
f21=-d/k; f22=(rho*xe(2)/xe(3)); f23=-(d/xe(3));
F=[0 1 0; f21 f22 f23; 0 0 0]; %state jacobian
M(:,:,nn+1)=(F*P(:,:,nn)*F')+ (W*Sw*W');
%
%Update
56 1 Kalman Filter, Particle Filter and Other Bayesian Filters
ya=sqrt(L2+xa(1)^2);
h1=xa(1)/ya;
H=[h1 0 0]; %measurement jacobian
K(:,nn+1)=(M(:,:,nn+1)*H')*inv((H*M(:,:,nn+1)*H')+(V*Sv*V'));
P(:,:,nn+1)=M(:,:,nn+1)-(K(:,nn+1)*H*M(:,:,nn+1));
xe=xa+(K(:,nn+1)*(ym-ya)); %estimated (a posteriori) state
nn=nn+1;
end;
%------------------------------------------
%display
figure(1)
subplot(1,2,1)
plot(tim,rx(1,1:Nf),'kx'); hold on;
plot(tim,rxe(1,1:Nf),'r');
title('altitude'); xlabel('seconds')
axis([0 Nf*T 0 12*10^4]);
subplot(1,2,2)
plot(tim,rx(2,1:Nf),'kx'); hold on;
plot(tim,rxe(2,1:Nf),'r');
title('velocity'); xlabel('seconds');
axis([0 Nf*T -6000 1000]);
An extended version of the Program 1.11 has been included in Appendix B. Next
three figures have been obtained with that program.
Figure 1.30 shows the evolution of the state estimation error along an experiment.
Figure 1.31 shows the evolution of the covariance matrix P. Notice the uncertainty
increases near the middle of the plots (this is coincident with the drag peak).
3000 500
altitude estimation error 400 velocity estimation error
2000
300
200
1000
100
0 0
-100
-1000
-200
-300
-2000
-400
-3000 -500
0 10 20 30 40 0 10 20 30 40
seconds seconds
x 10 4
10 0
8
-500
P11 P12
6
-1000
4
-1500
2
0 -2000
0 10 20 30 40 0 10 20 30 40
0 4000
-500 3000
P21 P22
-1000 2000
-1500 1000
-2000 0
0 10 20 30 40 0 10 20 30 40
sampling periods
-3
x 10
0.12 1
0.08
-0.5
0.06 -1
-1.5
0.04
-2
0.02
-2.5
0 -3
0 10 20 30 40 0 10 20 30 40
seconds seconds
Figure 1.32 shows the evolution of the Kalman gains along an experiment. The
Kalman gain decreases in the moments of increasing uncertainty.
58 1 Kalman Filter, Particle Filter and Other Bayesian Filters
after
and linearization
0.5 0.5
0 0
-0.5 -0. 5
-1 -1
-1.5 -1.5
0 2000 -10 -5 0 5 10
before
2000
0
-10 -5 0 5 10
In order to see in more detail what happens when the standard deviation of the orig-
inal PDF changes, another program, Program 1.13, has been developed. Figure 1.34
shows the linearized propagation, and the nonlinear propagation, of a Gaussian PDF,
for three values of the standard deviation: 0.2, 0.4, 0.6.
60 1 Kalman Filter, Particle Filter and Other Bayesian Filters
0 0 0
-1 0 1 -1 0 1 -1 0 1
As the PDF becomes wider the nonlinear propagation (N ) diverges more from
the linearized propagation (L). If you try to increase the standard deviation above the
value 0.6, the tails of the original PDF get out from the range (1.51.5) imposed
by arctan(). This could become a problem in a practical situation.
tdat=ttg*(adat)+tb;
%histogram of data through tangent
htg=hist(tdat,bx);
subplot(1,3,nn)
plot(bx,hpt,'b'); hold on;
plot(bx,htg,'k');
tit=['sigma= ',num2str(sig)];
title(tit);
axis([-1.5 1.5 0 4500]);
end;
Another problem with the saturation limits imposed by arctan() becomes apparent
if the original PDF is shifted. For instance, consider the case of the original Gaussian
PDF being shifted to the right by 0.8. Figure 1.35, which has been generated with
the Program 1.14, compares the nonlinear propagation (N ) and the linearized prop-
agation (L). The standard deviation has been set to 0.5, in order to avoid saturation
(although a little can be observed at the right tail of L). The mean of the data prop-
agated through arctan() has been depicted with a cross on the horizontal axis. The
peaks of the N and L curves do not coincide. Neither of the peaks coincides with
the propagated data mean. The PDF shift causes asymmetry.
4000
L
3000
2000
1000
0
-1.5 -1 -0.5 0 0.5 1 1.5
62 1 Kalman Filter, Particle Filter and Other Bayesian Filters
Since computers are more and more powerful, one is tempted to use brute force
methods, and this could be reasonable, especially when the alternatives, if any, do not
offer good performances. For the approximation of PDFs, it is possible to propagate
a lot of samples through the nonlinearities, and then use statistics of the propagated
samples. An example of this approach is particle filters, which will be introduced in
the next section.
However, it is possible to approximate the propagated PDFs based on the propa-
gation of a few, conveniently selected samples, like for instance the so-called sigma-
points [108].
One of the most cited methods for sigma-point Kalman filtering is the Unscented
Kalman Filter (UKF). With UKF it is not necessary to compute Jabobians nor
Hessians, so UKF belongs to the class of derivative-free Kalman filtering meth-
ods. Since UKF uses a few sigma points, it requires moderate computational effort.
The propagated PDFs are approximated using the mean and variance obtained from
propagated sigma points [46, 53, 96, 115].
This section is devoted to introduce UKF and it is divided into two subsections.
The first deals with the Unscented Transform, which is the basis of UKF. The
satellite tracking example will be used to illustrate the main steps of UT. The second
subsection describes the UKF algorithm, and includes its application to the falling
body example.
y = f(x) (1.168)
Suppose that the function is applied to a set of random data with a certain original
PDF. Another set of data is obtained, with a propagated PDF. The idea of sigma
points is to be able to obtain the mean and covariance matrix of the original data,
using a set of M selected data i , as follows:
M
x = wi i (1.169)
i=0
M
Px x = wi (i x )(i x )T (1.170)
i=0
where wi are suitable weights. They are normalized, so the sum of weights is one.
In addition, the propagated sigma points,
64 1 Kalman Filter, Particle Filter and Other Bayesian Filters
Yi = f(i ) (1.171)
are used to obtain the mean and covariance matrix of the propagated data:
M
y = wi Yi (1.172)
i=0
M
Pyy = wi (Yi y )(Yi y )T (1.173)
i=0
and also:
M
Px y = wi (i x )(Yi y )T (1.174)
i=0
0 = x
i = 0 + i , i = 1, . . . , N (1.175)
i = 0 i , i = N + 1, . . . , 2N
where is a scaling factor, N is the dimension of the vector space, and i is the i-th
column of the matrix square root of Px x (the covariance matrix):
Px x = T (1.176)
The matrix square root can be obtained using the Cholesky decomposition.
In order to determine the weights wi a set of constraint equations can be estab-
lished. For example, in the scalar variable case:
M
wi 1 =0 (1.177)
i=0
M
wi i x = 0 (1.178)
i=0
M
wi (i x )(i x ) T
Px x = 0 (1.179)
i=0
Suppose that the original PDF is Gaussian N (0 , 1), and that = 1. Then, the con-
straint equations are:
1.6 Unscented Kalman Filter (UKF) 65
0.1
0.05
0
-10 -8 -6 -4 -2 0 2 4 6 8 10
w0 + w1 + w2 1 = 0
w1 1 w2 2 0 = 0 (1.180)
w1 12 + w2 22 1 = 0
1 = 2 (1.181)
w1 = w2 = 1 /(2 12 )
(1.182)
w0 = 1 2 w1
A first example of sigma points is illustrated in Fig. 1.36, using a Gaussian PDF.
There are three sigma points, represented with cross marks on the horizontal axis.
The lateral points are placed at a distance (the standard deviation) from the central
sigma point.
x1s1=mu+sig;
x1s2=mu-sig;
%the points on PDF
y1s0=max(y1); %the PDF peak
aux=(x1s1-mu).^2;
y1s1=(exp(-aux/(2*(sig^2)))/(sig*sqrt(2*pi)));
aux=(x1s2-mu).^2;
y1s2=(exp(-aux/(2*(sig^2)))/(sig*sqrt(2*pi)));
figure(1)
plot(x1,y1,'k'); hold on; %the PDF
plot(x1s0,0,'rx','MarkerSize',12); %central sigma point
plot(x1s1,0,'rx','MarkerSize',12); %right sigma point
plot(x1s2,0,'rx','MarkerSize',12); %left sigma point
plot([x1s0 x1s0],[0 y1s0],'b--'); %central sigma point line
plot([x1s1 x1s1],[0 y1s1],'b--'); %right sigma point line
plot([x1s2 x1s2],[0 y1s2],'b--'); %left sigma point line
title('sigma points');
axis([-10 10 0 0.15]);
Another example of sigma points is shown in Fig. 1.37 with a bivariate Gaussian
PDF and five sigma points.
after
1 1 measurement
0.5 0.5
0 0
-0.5 -0.5
-1 -1
-1.5 -1.5
0 2000 -10 -5 0 5 10
before
2000
0
-10 -5 0 5 10
The sigma points are propagated through the nonlinear function. An example
of this is displayed on Fig. 1.38. With the propagated sigma points it is possible to
compute the mean and variance of the propagated data. This information could be
applied to approximate the propagated PDF with a Gaussian PDF. The figure has
been generated with a program that has been included in Appendix B.
It has been noticed that the distance of the lateral sigma points to the central point
increases as N increases. This is not convenient, and a scaling scheme has been
devised to circumvent this problem:
0 = x
i = 0 + ((N + ) Px x )i , i = 1, . . . , N (1.183)
i = 0 ( (N + ) Px x )i , i = N + 1, . . . , 2N
where (matri x)i means the i-th column of the matrix, and is a scaling parameter,
such that:
= 2 (N + ) N (1.184)
The parameter determines the spread of the sigma points around the centre, and
usually takes a small positive value equal or less than one. The parenthesis term
(N + ) is usually equal to 3.
The weighting factors are chosen as follows:
w0m = N + ; w0c = w0m + (1 2 + )
(1.185)
wi = 2(N 1+ ) , i = 1, . . . , 2N
640
620
m
600
580
560
540
520
500
0.5 1 1.5 2 2.5
rad
550
m
500
450
400
350
-500 -400 -300 -200 -100 0 100 200 300 400 500
m
A set of five sigma points have been selected. They are shown with cross marks.
The data are transformed to Cartesian coordinates (a polar to Cartesian coordinate
transformation). Figure 1.40 shows the PDF of the propagated data. In addition, the
propagated sigma points have been plotted with cross marks. Clearly, part of the
symmetry of the original sigma points is lost on the Cartesian plane.
Now, the weights are applied to obtain the mean and covariance of the propagated
data. With this information, a bivariate Gaussian PDF has been obtained (having the
computed mean and covariance) as approximation to the propagated PDF. Figure 1.41
shows the cloud of propagated data and a contour plot of the bivariate Gaussian PDF.
The mean of propagated data has been plotted with a cross mark, while the centre
of the PDF has been plotted with a plus sign. Notice that both points are almost
coincident.
70 1 Kalman Filter, Particle Filter and Other Bayesian Filters
600
500
m
400
300
200
100
0
-600 -400 -200 0 200 400 600
m
Figures 1.39, 1.40 and 1.41 have been generated with a long program that has
been included in Appendix B.
There are some variations of the Kalman filter that use sigma points. They are
denoted as sigma point Kalman filters. The UKF is an important example of this
type of filters.
The UKF takes in a simple way the unscented transformation to obtain the covariance
matrices used to compute the Kalman gains. Like the standard Kalman filter, UKF
proceeds with a repeated two step algorithm. The details of the UKF algorithm are
as follows.
Given the (nonlinear) system:
x(n + 1) = f(x(n), u(n), w(n)) (1.186)
Define a set of sigma points xs(i) according with the scaled unscented transform
previously described.
Now, repeat the following steps.
(a) Prediction
2N
(0) (i)
xa (n + 1) = w0m xas (n + 1) + wi xas (n + 1) (1.189)
i=1
M(n + 1) = w +
(0) (0)
+ w0c (xas (n + 1) xa (n + 1)) (xas (n + 1) xa (n + 1))T +
2N (1.190)
(i) (i)
+ wi (xas (n + 1) xa (n + 1)) (xas (n + 1) xa (n + 1))T
i=1
(b) Update
2N
(0) (i)
ya (n + 1) = w0m yas (n + 1) + wi yas (n + 1) (1.192)
i=1
Syy (n + 1) = v +
(0) (0)
+ w0c (yas (n + 1) ya (n + 1)) (yas (n + 1) ya (n + 1))T +
2N (1.193)
(i) (i)
+ wi (yas (n + 1) ya (n + 1)) (yas (n + 1) ya (n + 1))T
i=1
Sx y (n + 1) =
(0) (0)
= w0c (xas (n + 1) xa (n + 1)) (yas (n + 1) ya (n + 1))T +
2N (1.194)
(i) (i)
+ wi (xas (n + 1) xa (n + 1)) (yas (n + 1) ya(n + 1))T
i=1
K (n + 1) = Sx y (n + 1) S yy (n + 1)1 (1.195)
4
x 10
12 1000
altitude velocity
0
10
-1000
8
-2000
6
-3000
4
-4000
2
-5000
0 -6000
0 10 20 30 40 0 10 20 30 40
seconds seconds
Fig. 1.42 System states (cross marks), and states estimated by the UKF (continuous)
xe (n + 1) = xa (n + 1) + K (n + 1) [y(n + 1) ya (n + 1) ] (1.197)
As it was done in the EKF section, the example of the falling body is used to illus-
trate the UKF algorithm. Figure 1.42 shows the main result, which is the estimation
of altitude and velocity of the body along time. The figure has been generated with
the Program 1.17.
Notice that the MATLAB function chol() has been chosen to compute the matrix
square root for the sigma points. This function uses the Cholesky factorization. Some
care is needed for the specification of the sigma point parameters, to keep along
experiments a positive definite covariance matrix (otherwise MATLAB refuses to
apply the factorization). Note also that in the program, the central sigma point has
been indexed with number 7, instead of 0.
%system output
y=on(nn)+sqrt(L2+(x(1)^2));
ym=y; %measurement
ry(nn)=ym; %measurement recording
%Prediction
%sigma points
sqP=chol(lN*P(:,:,nn)); %matrix square root
xs(:,7)=xe;
xs(:,1)=xe+sqP(1,:)'; xs(:,2)=xe+sqP(2,:)';
xs(:,3)=xe+sqP(3,:)';
xs(:,4)=xe-sqP(1,:)'; xs(:,5)=xe-sqP(2,:)';
xs(:,6)=xe-sqP(3,:)';
%a priori state
%propagation of sigma points (state transition)
for m=1:7,
rho=rho0*exp(-xs(1,m)/k); %air density
d=(rho*(xs(2,m)^2))/(2*xs(3,m)); %drag
xas(1,m)=xs(1,m)+(xs(2,m)*T);
xas(2,m)=xs(2,m)+((g+d)*T);
xas(3,m)=xs(3,m);
end;
%a priori state mean (a weighted sum)
xa=0;
for m=1:6,
xa=xa+(xas(:,m));
end;
xa=xa/(2*lN);
xa=xa+(LaN*xas(:,7));
%a priori cov.
aux=zeros(3,3); aux1=zeros(3,3);
for m=1:6,
aux=aux+((xas(:,m)-xa(:))*(xas(:,m)-xa(:))');
end;
aux=aux/(2*lN);
aux1=((xas(:,7)-xa(:))*(xas(:,7)-xa(:))');
aux=aux+(aaN*aux1);
M(:,:,nn+1)=aux+Sw;
%Update
%propagation of sigma points (measurement)
for m=1:7,
yas(m)=sqrt(L2+(xas(1,m)^2));
end;
%measurement mean
ya=0;
for m=1:6,
ya=ya+yas(m);
end;
ya=ya/(2*lN);
ya=ya+(LaN*yas(7));
1.6 Unscented Kalman Filter (UKF) 75
%measurement cov.
aux2=0;
for m=1:6,
aux2=aux2+((yas(m)-ya)^2);
end;
aux2=aux2/(2*lN);
aux2=aux2+(aaN*((yas(7)-ya)^2));
Syy=aux2+Sv;
%cross cov
aux2=0;
for m=1:6,
aux2=aux2+((xas(:,m)-xa(:))*(yas(m)-ya));
end;
aux2=aux2/(2*lN);
aux2=aux2+(aaN*((xas(:,7)-xa(:))*(yas(7)-ya)));
Sxy=aux2;
%Kalman gain, etc.
K(:,nn+1)=Sxy*inv(Syy);
P(:,:,nn+1)=M(:,:,nn+1)-(K(:,nn+1)*Syy*K(:,nn+1)');
xe=xa+(K(:,nn+1)*(ym-ya)); %estimated (a posteriori) state
nn=nn+1;
end;
%------------------------------------------
%display
figure(1)
subplot(1,2,1)
plot(tim,rx(1,1:Nf),'kx'); hold on;
plot(tim,rxe(1,1:Nf),'r');
title('altitude'); xlabel('seconds')
axis([0 Nf*T 0 12*10^4]);
subplot(1,2,2)
plot(tim,rx(2,1:Nf),'kx'); hold on;
plot(tim,rxe(2,1:Nf),'r');
title('velocity'); xlabel('seconds');
axis([0 Nf*T -6000 1000]);
An extended version of the Program 1.17 has been included in Appendix B. Next
three figures have been obtained with that program.
Figure 1.43 shows the evolution of the state estimation error along an experiment.
Figure 1.44 shows the evolution of the covariance matrix P. Notice the uncertainty
increase near the middle of the plots (this is coincident with the drag peak).
Figure 1.45 shows the evolution of the Kalman gains along an experiment. The
evolution of the Kalman gains reflect the changes in the matrix P.
76 1 Kalman Filter, Particle Filter and Other Bayesian Filters
3000 500
altitude estimation error 400 velocity estimation error
2000
300
200
1000
100
0 0
-100
-1000
-200
-300
-2000
-400
-3000 -500
0 10 20 30 40 0 10 20 30 40
seconds seconds
x105 x104
4 2
3 1.5
P11 P12
2 1
1 0.5
0 0
0 10 20 30 40 0 10 20 30 40
4 4
x 10 x 10
2 5
4
1.5
P21 3 P22
1
2
0.5
1
0 0
0 10 20 30 40 0 10 20 30 40
sampling periods
Particle filters are applications of Monte Carlo methods to Bayesian estimation. Sets
of random states are used to obtain expected values, variances, etc. corresponding
to dynamic processes. These random states are called particles, then sounding as a
statistical mechanics metaphor.
1.7 Particle Filter 77
-3
x 10
0.35 20
Kalman gain, altitude Kalman gain: velocity
0.3
15
0.25
10
0.2
0.15 5
0.1
0
0.05
-5
0
0 10 20 30 40 0 10 20 30 40
seconds seconds
Like the other filters already described, the particle filter algorithm repeats prediction
and update steps. The algorithm neatly belongs to the Bayesian filtering approach,
as it was briefly described in the chapter introduction.
The type of particle filter selected for this subsection is a sequential importance
resampling (SIR) filter. It has been introduced by several authors with different
names, such as bootstrap filter, condensation algorithm, etc.
Suppose there are N random samples from the posterior PDF at time step k 1,
j
p(xk1 | Yk1 ). These samples are called particles, and will be denoted as xk1 with
j = 1, 2. . ., N .
78 1 Kalman Filter, Particle Filter and Other Bayesian Filters
The prediction operation is just the propagation of the particles through the system
transition equation:
j j j
xk = f(xk1 , wk1 ) (1.198)
In this way, a set of particles from the prior PDF p(xk | Yk1 )has been obtained.
j
For the update operation a weight Wk is computed for each particle, based on the
j
measurements at time step k. The weight Wk is the measurement likelihood evaluated
j
at the value of xk :
j j
Wk = p(yk | xk ) (1.199)
The weights are then normalized to sum unity. Once the weights are obtained, there
is a resampling step. The objective is to produce a set of samples from the posterior
j
PDF p(xk | Yk ). These samples are extracted from xk . The resampling procedure
j
is as follows: a particle from xk is chosen with a probability equal to its weight;
j
the procedure is repeated N times to get a new set xk ; the same particle could be
chosen several times. The idea is to get more samples from the more plausible states
according with their measurement likelihoods.
Let us write in detail these steps, using an easy to program notation.
The program begins with the generation of a set of particles px(0). For simpler
notation, the index for each particle has been dropped.
(a) Prediction
Use the transition equation to propagate the particles:
apx(n + 1) = f( px(n)
, w(n)) (1.200)
(b) Update
Use the measurement equation to obtain measurements of the propagated particles
and their standard deviations:
y(n + 1) = h(apx(n
+ 1), v(n + 1)) (1.201)
y (n + 1) = ym y(n + 1) (1.202)
(in the case of our program, ym is obtained via simulation of the system, in real-time
applications this value could be just the average of measurements).
Evaluate the measurement PDF at the propagated particles. Then, normalize to
obtain the weights. For instance:
4
x 10
12 1000
altitude velocity
0
10
-1000
8
-2000
6
-3000
4
-4000
2
-5000
0 -6000
0 10 20 30 40 0 10 20 30 40
seconds seconds
Fig. 1.46 System states (cross marks), and states estimated by the particle filter (continuous)
The Program 1.18 provides an implementation of the particle filter for the falling
body example. Figure 1.46 shows the results, which are quite good.
%process noise
Sw=[10^5 0 0; 0 10^3 0; 0 0 10^2]; %cov
w11=sqrt(Sw(1,1)); w22=sqrt(Sw(2,2)); w33=sqrt(Sw(3,3));
w=[w11; w22; w33];
%observation noise
Sv=10^6; %cov.
v11=sqrt(Sv);
%------------------------------------------
%Prepare for filtering
%space for recording er(n), xe(n)
rer=zeros(3,Nf); rxe=zeros(3,Nf);
%------------------------------------------
%Behaviour of the system and the filter after initial state
x=[10^5; -5000; 400]; %initial state
xe=x; %initial estimation
%prepare particles
Np=1000; %number of particles
%reserve space
px=zeros(3,Np); %particles
apx=zeros(3,Np); %a priori particles
ny=zeros(1,Np); %particle measurements
vy=zeros(1,Np); %meas. dif.
pq=zeros(1,Np); %particle likelihoods
%particle generation
wnp=randn(3,Np); %noise (initial particles)
for ip=1:Np,
px(:,ip)=x+(w.*wnp(:,ip)); %initial particles
end;
%system noises
wx=randn(3,Nf); %process
wy=randn(1,Nf); %output
nn=1;
while nn<Nf+1,
%estimation recording
rxe(:,nn)=xe; %state
rer(:,nn)=x-xe; %error
%Simulation of the system
%system
rx(:,nn)=x; %state recording
rho=rho0*exp(-x(1)/k); %air density
d=(rho*(x(2)^2))/(2*x(3)); %drag
rd(nn)=d; %drag recording
%next system state
x(1)=x(1)+(x(2)*T);
x(2)=x(2)+((g+d)*T);
x(3)=x(3);
x=x+(w.*wx(:,nn)); %additive noise
%system output
y=sqrt(L2+(x(1)^2))+(v11*wy(nn)); %additive noise
1.7 Particle Filter 81
ym=y; %measurement
%Particle propagation
wp=randn(3,Np); %noise (process)
vm=randn(1,Np); %noise (measurement)
for ip=1:Np,
rho=rho0*exp(-px(1,ip)/k); %air density
d=(rho*(px(2,ip)^2))/(2*px(3,ip)); %drag
%next state
apx(1,ip)=px(1,ip)+(px(2,ip)*T);
apx(2,ip)=px(2,ip)+((g+d)*T);
apx(3,ip)=px(3,ip);
apx(:,ip)=apx(:,ip)+(w.*wp(:,ip)); %additive noise
%measurement (for next state)
ny(ip)=sqrt(L2+(apx(1,ip)^2))+(v11*vm(ip)); %additive noise
vy(ip)=ym-ny(ip);
end;
%Likelihood
%(vectorized part)
%scaling
vs=max(abs(vy))/4;
ip=1:Np;
pq(ip)=exp(-((vy(ip)/vs).^2));
spq=sum(pq);
%normalization
pq(ip)=pq(ip)/spq;
%Prepare for roughening
A=(max(apx')-min(apx'))';
sig=0.2*A*Np^(-1/3);
rn=randn(3,Np); %random numbers
%================================
%Resampling (systematic)
acq=cumsum(pq);
cmb=linspace(0,1-(1/Np),Np)+(rand(1)/Np); %the "comb"
cmb(Np+1)=1;
ip=1; mm=1;
while(ip<=Np),
if (cmb(ip)<acq(mm)),
aux=apx(:,mm);
px(:,ip)=aux+(sig.*rn(:,ip)); %roughening
ip=ip+1;
else
mm=mm+1;
end;
end;
%=================================
%Results
%estimated state (the particle mean)
xe=sum(px,2)/Np;
nn=nn+1;
82 1 Kalman Filter, Particle Filter and Other Bayesian Filters
end;
%------------------------------------------
%display
figure(1)
subplot(1,2,1)
plot(tim,rx(1,1:Nf),'kx'); hold on;
plot(tim,rxe(1,1:Nf),'r');
title('altitude'); xlabel('seconds')
axis([0 Nf*T 0 12*10^4]);
subplot(1,2,2)
plot(tim,rx(2,1:Nf),'kx'); hold on;
plot(tim,rxe(2,1:Nf),'r');
title('velocity'); xlabel('seconds');
axis([0 Nf*T -6000 1000]);
Figure 1.47 shows the estimation errors recorded during an experiment. The figure
has been generated with an extended version of the Program 1.18 that has been
included in Appendix B.
3000 500
altitude estimation error 400
velocity estimation error
2000
300
200
1000
100
0 0
-100
-1000
-200
-300
-2000
-400
-3000 -500
0 10 20 30 40 0 10 20 30 40
seconds seconds
To provide a tool for the reader, which may be used to check the resampling
schemes presented below, a program has been developed. This program is included
in Appendix B. All the figures in this subsection have been generated with this
program.
The four resampling schemes to be described now, have been implemented and
included in the same program. The reader could easily examine the results of any of
these schemes. The idea of the program is to simulate some first steps (up to three)
of the falling body example, and then apply all four resampling schemes to obtain
separately the next generation of particles. The results of the different schemes can
be compared.
In all simulation experiments, it can be observed that before resampling, many
particles have low weights. Figure 1.48 shows, for example, a typical histogram of
weights, obtained in a experiment. The left column clearly express the abundance of
particles with low weights.
The effect of getting a dispersion of particles (low importance weights) has been
denoted as weight degeneracy. Resampling mechanisms are intended to counteract
this effect.
To further increase the diversity of particles after resampling, some noise can be
added: this is denoted as roughening. Next program fragments give examples of it.
In general, the nucleus of resampling schemes is inversion sampling, where the
F(.) is built using the importance weights. The MATLAB cumsum() function is
employed for this purpose. Figure 1.49 shows the result of this function in a typical
experiment.
Figure 1.50 zooms on part of the cumsum() plot. When resampling, the number
of copies of the particle apx(mm) (so a priori particles are denoted in the MATLAB
program) should be proportional to the corresponding normalized weight.
200
150
100
50
0
0 0.5 1 1.5 2 2.5 3 3.5
-3
x 10
84 1 Kalman Filter, Particle Filter and Other Bayesian Filters
0.8
0.6
0.4
0.2
0
0 100 200 300 400 500 600 700 800 900 1000
Wki
mm
Let us proceed with the description of four resampling methods. Pieces of MAT-
LAB code, extracted from the program in Appendix B, are included. These pieces
correspond to each of the resampling methods, and serve to give the details of the
methods implementation.
x(F 1 (u j )) (1.206)
1.7 Particle Filter 85
Below is a MATLAB program segment that does multinomial resampling. The first
lines compute the cumulative sum acq of the normalized weights pq, and generate an
ordered set of uniform random numbers using the MATLAB sort() function. Then,
there is a loop that obtains the roughened copies, Mpx(ip), of selected prior particles.
The program use two pointers. The pointer mm selects prior particles (the source
of copies). The pointer ip selects posterior particles (the roughened copies). The
internal while..end determines values of mm trying to select good candidates for
replication (prior particles with large weights).
Figure 1.51 shows histograms of the prior particles and the posterior particles
obtained by multinomial resampling. The figure corresponds to the third step of the
falling body simulation.
50
0
9.35 9.4 9.45 9.5 9.55 9.6 9.65 9.7 9.75 9.8
4
xx10
150
histogram of (multinomial) resampled particles
100
50
0
9.35 9.4 9.45 9.5 9.55 9.6 9.65 9.7 9.75 9.8
4
x 10
86 1 Kalman Filter, Particle Filter and Other Bayesian Filters
( j 1) + u
uj = , u U [0, 1) (1.207)
Np
Let us describe in more detail the MATLAB fragment. Like before, two pointers,
mm and ip, are used. There is a main loop that obtains one by one the posterior
particles Spx(:,ip). Suppose that, for instance, ip = 1; the first member of the comb
cmb is compared with the first member of acq; the index mm is increased until
acq(mm) becomes larger than cmb(1). Suppose that this happens when mm = 3,
the posterior particle Spx(:,1) is formed using the particle apx(:,3) and some added
roughening noise. Now, i p = 2. If cmb(2) < acq(3), a second particle Spx(:,2) is
obtained using apx(:,3). And so on, until there is n such that acq(3) is not larger
than cmb(n). In this case, no more roughened copies of apx(:,3); mm is increased
until acq(mm) > cmb(n). Then, the same procedure is followed to make roughened
copies of apx(:,mm). And so on. The process is repeated until Np posterior particles
are obtained.
The algorithm is similar to the systematic resampling. Only that the ordered random
numbers are obtained as follows:
( j 1) + u j
uj = , u j U [0, 1) (1.208)
Np
Below is a MATLAB fragment with the stratified resampling. The posterior particles
have been denoted as Fpx(ip).
The residual resampling could be described as a two pass process. In the first, deter-
ministic pass, na(j) copies of the particle apx(j) are obtained, where na(j) is computed
as floor(Np*pq(j)) (recall that pq() are the normalized weights). Suppose that NR
copies have been obtained. The second pass gets the remaining number of particles,
Np-NR, considering that the probability for selecting apx(i) is proportional to rpq(i)
(a modified weight: see program).The second pass could use any of the previous
resampling schemes.
%preparation
na=floor(Np*pq); %repetition counts
NR=sum(na); %total count
Npr=Np-NR; %number of non-repeated particles
rpq=((Np*pq)-na)/Npr; %modified weights
acq=cumsum(rpq); %for the monomial part
%deterministic part
mm=1;
for j=1:Np,
for nn=1:na(j),
Rpx(:,mm)=apx(:,j);
mm=mm+1;
end;
end;
%multinomial part:
nr=sort(rand(1,Npr)); %ordered random numbers (0, 1]
for j=1:Npr,
while(acq(mm)<nr(j)),
mm=mm+1;
end;
aux=apx(:,mm);
Rpx(:,NR+j)=aux+(sig.*rn(:,j)); %roughening
end;
1.7.7 Comparison
Some comparison studies conclude that stratified and systematic resampling are bet-
ter than multinomial resampling in terms of better filter estimates and computational
complexity. In the case of residual resampling, approximately half of the copies
would be deterministic, and the other half stochastic: this means less computations,
but it could be compensated by recalculation of weights and other aspects of the
algorithm.
Figure 1.52 shows histograms of resampled particles corresponding to third step
of the falling body simulation. Each histogram has been obtained with one of the
four resampling methods. Notice that residual resampling results in less variance.
1.7.8 Roughening
k M N p1/n (1.209)
1.7 Particle Filter 89
120 150
Systematic Stratified
100
resampling resampling
80 100
60
40 50
20
0 0
9.4 9.5 9.6 9.7 9.8 9.4 9.5 9.6 9.7 9.8
x 104 x 10 4
120 250
Multinomial Residual
100 200
resampling resampling
80
150
60
100
40
20 50
0 0
9.4 9.5 9.6 9.7 9.8 9.4 9.5 9.6 9.7 9.8
4 4
x 10 x 10
where n is the dimension of the space, k is a tuning parameter, and the j-th element
of the vector M is:
M( j) = max(xm k ( j) xk ( j))
n
(1.210)
m,n
A majority of particle filters use Sequential Monte Carlo (SMC) sampling; only
in some cases batch mode is chosen. The type of sampling is usually importance
sampling, employing a proposal importance function [7].
Suppose one wants to obtain an expectation of the form:
E( f ) = f (xk ) p(xk | Yk ) dxk (1.211)
N
(1/N ) wk (xki ) f (xki )
N
i=1
E( f ) = Wk (xki ) f (xki ) (1.216)
N
(1/N ) wk (xki ) i=1
i=1
where
wk (xki )
Wk (xki ) = (1.217)
N
wk (xki )
i=1
1.7 Particle Filter 91
Therefore:
p(yk |xk ) p(xk |xk1 )
wk (xk ) = wk1 (xk1 ) (1.222)
q(xk |xk1 , Yk )
With this last expression, all the elements for the sequential importance sampling
(SIS) algorithm have been obtained. Let us summarize the algorithm:
SIS Algorithm:
On the basis of the normalized weights, one could approximate the posterior
distribution as follows:
N
p(xk |Yk ) Wk (xki ) (xk xik ) (1.225)
i=1
N
1 i
N
k Wk (xki ) xik = x (1.226)
i=1
N i=1 k
1 i
N
cov(xk ) (x k )(xik k )T (1.227)
N i=1 k
Notice that when the posterior distribution is multimodal and/or skewed the mean
and the covariance may not be sufficient for statistical characterization.
A serious problem with the SIS algorithm is the degeneracy phenomenon: after
a few iterations all but one particle have insignificant weight. A theorem establishes
that the unconditional variance of the importance ratio increases over time (as the
algorithm is repeated). The importance ratio is:
p(xik |Yk )
(1.228)
q(xik |Yk )
SIR Algorithm:
wk (xik )
Wk (xki ) = (1.231)
N
wk (xik )
i=1
4. Apply resampling to obtain a set of equally weighted particles xik , which can be
used for estimation of the posterior PDF, mean, etc.
It has been shown that the optimal importance function that minimizes the variance
of the weights, conditioned upon xik1 and yk , is the following:
0.15
0.1
Prior
0.05
0
-10 -5 0 5 10 15
x
This variance minimization implies that Ne f f is maximized. Notice that the impor-
tance function combines the prior and the likelihood.
With this choice, the weight update is:
Most practical applications use the prior as proposal importance function, and so it
is named the standard proposal:
This proposal is the core of the bootstrap particle filter. The program 1.18 provides
an implementation example of it.
In certain cases, the predominant factor in the optimal proposal could be the likeli-
hood, so the (scaled) likelihood could be chosen as proposal:
It is opportune to remark that in many cases it is not easy to find a good proposal.
An avenue to get a solution is provided by factorization of the proposal, so it can
be sequentially constructed as the filter algorithm runs. This is a main feature of
sequential importance sampling.
Bootstrap techniques are commonly used in statistics, and are based on resampling
from observed data. The basic idea is to get estimations using cumulative distributions
of samples.
Before introducing other proposal function alternatives, it is convenient to con-
sider the case of Gaussian p(xk | xik1 , yk ). This may happen, for instance, when the
transition equation is nonlinear and the measurement equation is linear:
xk = f(xk1 ) + wk (1.238)
yk = C x k + v k (1.239)
mk = ( w1 f(xk1 ) + C T v1 yk ) (1.241)
Then,
p(xk | xk1 , yk ) = N (mk , ) (1.242)
This lucky Gaussian circumstance is not found in many applications. In general there
is no easy analytical way to get the posterior PDF, and you should try something
related with proposal importance functions.
Part of the research has proposed to use EKF, or UKF, to obtain before each
resampling step a proposal importance function. This idea has produced several
combinations of particle and Kalman filter versions. Some authors use the term
adaptation for that mechanism. However, the term adaptation is also used for certain
particle filter algorithms that change the number of particles along time, depending
on the range to explore in each moment. A reason for the adaptation of the number
of particles is to get fast particle filters, by using small filter populations (as small as
possible in each step).
In general, you should decide which type of proposal function would be adequate
for the process at hand, more or less fixed or adaptive.
96 1 Kalman Filter, Particle Filter and Other Bayesian Filters
Particles could also be used to estimate a good proposal before any resampling
step. It may be compared to a scouting party. The prediction effort could be addressed
to estimate the likelihood. In this context, it is illustrative to consider the Auxiliary
Particle Filter introduced in [84], which has been used in several reported applica-
tions with good results [113, 120].
Recall that the filtering target is to obtain the posterior in each step. Using Bayes
rule:
p(xk+1 |Yk+1 ) p(yk+1 |xk+1 ) p(xk+1 |Yk ) (1.244)
Observe that, according with the right-hand side of the last equation, the prior could
be regarded as a mixture density. Samples could be drawn as follows: select the i-th
component p(xk+1 | xik ) with probability Wk ( xik ), and then take a sample from it.
Combining the previous two equations, one has:
N
p(xk+1 |Yk+1 ) p(yk+1 |xk+1 ) Wk (xik ) p(xk+1 |xik ) (1.246)
i=1
N
p(xk+1 |Yk+1 ) Wk (xik ) p(yk+1 |xk+1 ) p(xk+1 |xik ) (1.247)
i=1
Again, this can be regarded as a mixture. Therefore, to sample from the posterior:
Another way to express this idea is by using an auxiliary variable, the index i,
and write:
One could draw, from each p(xk+1 , i | Yk+1 ) several samples, up to R. This could
be done using SIR, also obtaining R weights.
It has been suggested in [84] to use the following approximation for p(xk+1 , i |
Yk+1 ):
q(xk+1 , i|Yk+1 ) Wk (xik ) p(yk+1 |ik+1 ) p(xk+1 |xik ) (1.249)
where ik+1 is any likely value, the mean, the mode, etc., associated with the transition
PDF. Suitable approximations are of the form:
Then, samples would be drawn from p(xk+1 | xik ) with probability i = q( i|Yk+1 ).
These i are called first-stage weights, and correspond to likelihoods. The purpose
of this sampling is to select particles with large predictive likelihoods.
j
Having obtained R samples xk , a reweighting is done, using second-stage weights:
j
p(yk+1 |xk+1 )
wj = , j = 1, . . . , R (1.251)
p(yk+1 |[ik+1 ] j )
APF Algorithm:
When the process noise is small, APF is usually better than SIR. For large noise,
ik+1 would not give enough information about p(xk+1 | xik ) and APF could degrade.
Many particle filter variants have been and are being proposed. For evident reasons
of space here it is only possible to introduce, in a few words, some representative
examples.
98 1 Kalman Filter, Particle Filter and Other Bayesian Filters
In certain cases it may be possible to divide the process at hand into linear-Gaussian
and non-linear parts, so the state vector may be partitioned:
xkL
xk = (1.252)
xkN
L
xk+1 = f kL (xkN ) + AkL (xkN ) xkL + G kL (xkN ) wkL (1.255)
After initialization, the algorithm repeats the update and the prediction step:
(a) Update
Particle filter: obtain and normalize the importance weights corresponding to the
xk(i) particles, using the likelihood:
where:
yk(i) = h k (xkN (i) ) + Ck (xkN (i) ) xk(i) (1.258)
(i)
y,k = Ck (xkN (i) )Pk(i) Ck (xkN (i) )T + v,k
(i)
(1.259)
1.7 Particle Filter 99
xkL(i) = xa,k
L(i)
+ K k(i) (yk h k (xkN (i) ) Ck( xkN (i) ) xa,k
L(i)
) (1.262)
Resampling if required
(b) Prediction
Particle filter prediction using the first equation of the state space model (the term
with xkL is interpreted as process noise)
Kalman filter prediction for each particle using the second equation of the state
space model:
L(i)
xa,k+1 = f kL (xkN (i) ) + AkL (xkN (i) ) xkL(i) (1.263)
(i)
Mk+1 = AkL (xkN (i) ) Pk(i) AkL (xkN (i) )T + w (1.264)
(i)
L k+1 = Ck( xkN (i) ) Mk+1
(i)
Ck( xkN (i) )T + v (1.265)
Given a set of particles, PDFs could be approximated via the Dirac delta, like for
instance the posterior approximation mentioned in the SIS algorithm, which is repro-
duced below:
N
p(xk |Yk ) Wk (xki ) (xk xik ) (1.266)
i=1
2.5 sum
1.5
0.5
0
-1 -0.5 0 0.5 1 1.5 2 2.5 3
N
p(xk |Yk ) Wk (xki ) K (xk , xik ) (1.267)
i=1
where the kernel K (xk , xik ) is symmetric, unimodal and smooth, with a limited band-
width. Good candidates for the kernel are Gaussian or Epanechnikov kernels. In
essence, the kernel is used for interpolation between particles. Figure 1.54 illustrates
the approach. Observe that the sum of kernels, which are placed around the particles,
is a continuous approximation of the PDF.
The width of the kernel is chosen to minimize the difference between the posterior
PDF and the kernel estimate. Once the continuous approximation is obtained, a new
generation of particles, with the desired variance, could be extracted from it.
The regularization could be applied at the prediction step, or at the updating step.
Some authors propose to use the posterior estimation to estimate its gradient and
move the particles, following the gradient direction, toward the modes of the posterior
[18].
Heuristic optimization is being applied to obtain best kernels or to drive particles
to their neighboring maximum of the posterior [118].
Many of the new Bayesian filtering approaches are based on different ways of han-
dling the integrals that appear in the estimation methodology.
Then, it is convenient to look at the matter from the perspective of numerical
integration. It is good to see the place occupied by filters already described in this
chapter, and to open the floor for other filters that have not been mentioned yet.
1.8 The Perspective of Numerical Integration 101
Gaussian
Geometry
filters
Quadrature
Product Gauss-Hermite F
rules (GHF)
Other approximations
Non-product
rules Cubature Kalman F
Assumed
density Divided difference F
(CDF1, CDF2)
E
Expectation
t ti
propagation (EP)
This section begins with a rapid overview of numerical integration methods. Then,
it comes to important examples of their use for Bayesian filters (Fig. 1.55).
Let us come back to the roots. Many aspects of filtering involve integrals. For
instance, a typical target is to obtain the mean and the variance of the posterior:
E(x) = x p(x | Y) dx (1.268)
E(x2 ) = x2 p(x | Y) dx (1.269)
1.8.1 Geometry
1.8.2 Quadrature
There is a more general framework: the quadrature methods for numerical integra-
tion. The approximation just described, with evenly spaced knots, is a particular case
of quadrature. But in general, quadrature methods do not require the knots to be
equally spaced, and this can lead to more exact, and even computationally cheaper,
integration.
Actually, the mean value theorem says that:
b
f (x) d x = (b a) f (c), c [a, b] (1.272)
a
is said to be exact to degree nif it yields exact equality when f (x) is any polynomial
of degree n or less.
Just for illustration purposes, let us include a simple example. One tries the fol-
lowing approximation:
1
f (x) d x a f (x1 ) + b f (x2 ) (1.274)
1
1
x d x = 0 = a x1 + b x2
1
1
2
x2 dx = = a x12 + b x22
1 3
1
x 3 d x = 0 = a x13 + b x23
1
Therefore: 1
f (x) d x f ( 1/3) + f ( 1/3) (1.275)
1
for all j = k.
All the zeros of pm (x) are real, with multiplicity 1, and lie in (a, b). Let us denote
these zeros as: x1 , x2 , . . . , xm .
Now, consider the following case:
b
n
f (x) w(x) d x wi f (xi ) (1.277)
a i=1
where the knots are the zeros of an orthogonal polynomial pn (x), and the weights
are given by:
b
wi = li (x) w(x) d x (1.278)
a
with li (x) the i-th Lagrange interpolating polynomial for these knots.
104 1 Kalman Filter, Particle Filter and Other Bayesian Filters
n
pn (x) = f (xi ) li (x) (1.279)
i=1
b b
n
a pn (x) w(x) d x = a f (xi ) li (x) w(x) d x =
i=1
n b
n (1.280)
= f (xi ) a li (x)w(x) d x = f (xi ) wi
i=1 i=1
There are several sets of orthogonal polynomials, together with the corresponding
knots and weights for Gaussian quadrature, published in books and papers. The
orthogonal polynomials can be constructed with three term recurrence relations.
For the case of (a, b) = (-1, 1) and w(x) = 1 the orthogonal polynomials are the
Legendre polynomials.
Here is a table concerning most common orthogonal polynomials:
Finding the roots of polynomials may face numerical difficulties. For this reason,
conventional simple approaches are discouraged for quadrature purposes. Instead
there are algorithms based on finding the eigenvalues of a specific matrix (called the
Jabobi matrix). This matrix is built based on the recurrence relation of the chosen
orthogonal polynomial. A QR iteration could be used to find the eigenvalues and
eigenvectors.
MATLAB provides a series of functions, quad(), quadl(), quadgk(), quadv(), etc.,
for numerical integration using different types of quadrature.
A simple idea is to try approximations of the integrand, with easy to integrate com-
ponents. For instance, logarithms could be applied to convert products into sums.
Of course, series expansions around a certain point could be a convenient
approach. In certain cases it would be opportune to decompose the integral into
several regions, applying Taylor series in each one.
Consider a certain Taylor expansion:
where the coefficients bi are values of derivatives. The error of the approximation
would be a certain O(h). Now, let us halve h:
1 1 1
f (h/2) = f (0) + b1 h + b2 h 2 + b3 h 3 + ... (1.283)
2 4 8
The error would be approximately O(h)/2. Then, combining the two expansions:
1 3
g(h) = 2 f (h/2) f (h) = f (0) b2 h 2 b3 h 3 + ... (1.284)
2 4
where f (x) has a peak at a point x0 . Take the logarithm of the integrand and apply
a Taylor expansion around the peak:
106 1 Kalman Filter, Particle Filter and Other Bayesian Filters
b
ln f (x) ln f (x0 ) (x x0 )2 + . . . (1.286)
2
with:
2
b = ln f (x)|x = x0 (1.287)
x 2
Then, it is possible to approximate the integrand with a Gaussian:
b
exp(ln f (x) ) = f (x) f (x0 ) exp( (x x0 )2 ) (1.288)
2
And the value of the integral:
K f (x0 ) 2/b (1.289)
0.5
Sudents PDF (black), Laplace approximation (green)
0.4
0.3
0.2
0.1
0
-5 -4 -3 -2 -1 0 1 2 3 4 5
values
0.5
Sudents PDF (black), KLD approximation (green)
0.4
0.3
0.2
0.1
0
-5 -4 -3 -2 -1 0 1 2 3 4 5
values
Fig. 1.56 Approximations of Students T PDF: (top) Laplaces method, (bottom) KLD minimiza-
tion
b
f (x) b
f (x)
ln q(x) dx q(x) ln dx (1.292)
a q(x) a q(x)
This lower bound is obviously related with the KLD from q(x, ) to f (x), which is:
b
q(x, )
D K L (q || f ) = q(x, ) ln dx = B (1.294)
a f (x)
The approximation target would be to minimize the KLD, by suitable changes in the
parameters .
Another procedure for the variational approach is the following. An easy- to-
integrate lower bound g(x, ) is found such that:
Hence, b
I G() = g(x, ) d x (1.296)
a
108 1 Kalman Filter, Particle Filter and Other Bayesian Filters
Gaussian expressions are present in many filtering methods. It is not only when
the noises and perturbations are Gaussian. In addition, Gaussian functions may be
used to approximate unimodal distributions, and even multimodal distributions using
Gausian sums. Therefore, Gaussian filters cover a large assortment of methods. The
aim of this subsection is to offer a certain organized view of Gaussian filters, from
the point of view of numerical integration, and to introduce some filters that have
not been mentioned yet.
One of the reasons for the preference of theorists about Gaussian distributions is
that they are uniquely characterized by mean and covariance.
When dealing with Bayesian filtering in Gaussian conditions, it is typical to deal
with multiple integrals like the following:
1 1
I = ... F(x) exp ( (x )T 1 (x )) d x (1.297)
((2)d det ) 2
A filter based on the application of Gauss-Hermite quadrature for this integral has
been proposed. The acronym of the filter is GHF (Gaussian-Hermite filter). The main
idea is to decompose the multiple integral into a product of univariate integrals, and
apply Gauss-Hermite quadrature for each one. It is, then, a product rule. Weights
and knots are taken from standard tables.
Product rules suffer from the curse of dimensionality problems: as space dimen-
sion increases linearly the number of knots increases exponentially (it is a multi-
dimensional grid).
1.8 The Perspective of Numerical Integration 109
There are alternatives to product rules that obtain similar precision results with
less knots. For instance, lattice rules that transform the grid of knots according with
the integration domain; or sparse grids that concentrate most knots in suitable areas.
There are also rules that exploit the symmetries of the problem; let us focus on this
alternative.
In a fully symmetric set of cubature knots, equally weighted knots are symmetri-
cally distributed around origin. The set [p] can be obtained from a generator knot
p by sign changes and coordinate permutations. For example, given p = (1, 0), the
set [p] = (1, 0), (1, 0), (0, 1), (0, 1) can be generated.
Based on the invariant theory of Sobolev, it is possible to approximate the integral
with a third degree cubature rule:
d
d
I = w0 F(0) + w1 F( + T u ei ) + w1 F( T u ei ) (1.301)
i=1 i=1
where the weights are as in (1.298), and ei has d components with the i-th component
equal to 1 and the rest 0.
Notice that this approximation is the basis of the UKF filter.
More precision could be attained using:
with:
d 2 7d 4d 1
u= 3 , w0 = 1 + , w1 = , w2 = (1.303)
18 18 36
Kalman filter (CKF) has been proposed. The CKF third-degree approximation is:
d
d
I = w1 F( + ei ) + w1
T
F( T ei ) (1.304)
i=1 i=1
1 (i)
2N
xa (n + 1) = x (n + 1) (1.308)
2N i=1 ac
2N
(i) (i)
M(n + 1) = w xa (n + 1) xa (n + 1)T + xac (n + 1) xac (n + 1)T
i=1
(1.309)
(b) Update
1 (i)
2N
ya (n + 1) = y (n + 1) (1.311)
2N i=1 ac
1 (i)
2N
(i)
S yy (n + 1) = v ya (n + 1) ya (n + 1)T + yac (n + 1) yac (n + 1)T
2N
i=1
(1.312)
1 (i)
2N
(i)
Sx y (n + 1) = xa (n + 1) ya (n + 1)T + x (n + 1) yac (n + 1)T
2N i=1 ac
(1.313)
K (n + 1) = Sx y (n + 1) S yy (n + 1)1 (1.314)
xe (n + 1) = xa (n + 1) + K (n + 1) [y(n + 1) ya (n + 1) ] (1.316)
Go to (a)
1.8.4.3 Comparison
Many research contributions are devoted to compare different filters [26], but it is
not easy to get a clear view. The next comments are just a general prospect based on
[123].
In terms of precision one could take the integration of monomials as reference. For
instance, UKF3 tends to be better than the degree-2 GHF (let us denote it as GHF(2))
because UKF3 is exact for monomials up to degree 4 (with u = 3). According
with this precision criterion, the filters could be ordered as follows:
E K F < D D F1 < G H F(2) < U K F3 < D D F2 or C D F1 < C D F2
< G H F(3), U K F5
There is some controversy about computational costs, in special when comparing
EKF which has to compute the Jacobian- and the UKF3. The cost of DDF2 or CDF1
is similar to UKF3. UKF5 has higher cost. GHF has the highest cost, while DDF1
has the lowest.
In terms of numerical stability, it is desired that all weights were positive and all
knots were inside the integration limits. UKF3 would have a negative weight when
d > 3. GHF is the only one that is completely stable. Because of unstability, it may
happen that covariance becomes nonpositive in a filtering step, which means the
collapse of the algorithm. Some measures could be taken to prevent this problem.
!
n
f (x) = ti (x) (1.317)
i=1
The idea is to approximate with a q(x) the integrand term by term. The function
q(x) has an assumed PDF with some parameters to adjust. You start with the first
term and adjust q(x)to fit this term using a divergence measurement D (for instance
the KLD). Denote q1 (x) this first approximation. Now incorporate the second term
and get a(x) = q1 (x) t2 (x). Adjust q(x) to fit a(x) using D, obtain a new q2 (x).
Next, get a(x) = q2 (x) t3 (x), adjust q(x) to fit a(x) using D, obtain q3 (x)...And so
on, till all terms have been incorporated. The final qn (x) is the approximation to the
integrand.
In general the fitting steps are solved by moment matching.
There is a problem with this procedure: the result depends on the order by which
terms are incorporated.
A solution for this problem is expectation propagation (EP). The idea is to approx-
imate the factorized integrand by a factorized q(x):
1.8 The Perspective of Numerical Integration 113
!
n
q(x) = ti (x) (1.318)
i=1
to be close to !
t j (x) ti (x) (1.320)
i = j
The number of Bayesian filter variants and new proposals is large and continuously
increasing. This section intends to show the trends in this area, and to give the
distinctive ideas of algorithms that have found application success (Fig. 1.57).
Divide &
Ensemble KF (EnKF)
conquer
Distributed
Iterative KF (IKF)
Parallelized
Gaussian sum PF
(GSPF)
Combinations
GP-EKF Special
Gaussian GP-UKF
Processes Square root:
(GP) GP-PF (SR-UKF), (SR-CDKF)
UKF) (SR
Other
IMM-EKF, IMM-UKF, Polar KF, Spherical KF
Rao-Blackwellized KF Quaternion KF
IMM-GHF
MM-UPF Ensemble UKF (EnKF) Grid-based KF
Some key references are given for each identified activity nucleus, so the reader
is invited to explore the specific topics from these starting points.
Like particle filters the ensemble Kalman filter (EnKF) uses a random set of states: an
ensemble. The algorithm follows a cycle of prediction and update steps, on the basis of
a state space model, which usually has nonlinearities. The distinctive aspects are that
the error covariance matrix is computed from the ensemble (empirical estimation),
and that a common Kalman gain is used to update each ensemble member. See [62]
and references therein for more details.
This type of filter is widely used in numerical weather prediction (NWP), where
the dimension of the state space may be 107 . The use of EKF would involve 107
107 matrices, this is clearly excessive. Instead, the EnKF would require just some
hundreds of particles.
Other typical applications of EnKF are related to monitoring of ocean states and
currents, ecology, geophysics, etc.
The mentioned application contexts have a different perspective, and use a differ-
ent terminology. The filter is regarded as a data assimilation tool. Data assimilation
obtains agreements combining model predictions and observations. The prediction
step is called the forecast step; the update step is called the analysis step. The state
space model is usually written as follows:
xk = Mk (xk1 ) + uk (1.321)
yk = Hk (xk ) + vk (1.322)
(i)
xkb(i) = Mk (xak1 ) + uk ; i = 1, . . . , N (1.323)
Now, it is possible to compute the mean and the pertinent covariance matrices:
1 b (i)
N
ykb = y (1.325)
N i=1 k
1 b (i)
N
Pxky = (x xkb ) (ybk (i) ykb )T (1.326)
N 1 i=1 k
1 b (i)
N
k
Pyy = (y ykb ) (ybk (i) ykb )T (1.327)
N 1 i=1 k
K k = Pxky (Pyy
k
+ v )1 (1.328)
xa(i)
k = xkb(i) + K k (ys(i)
k Hk (xbk (i) ) ); i = 1, . . . , N (1.329)
where yks (i) are surrogate observations, obtained by sampling from a normal distrib-
ution with mean ykb and covariance v .
xe (n + 1) = xa (n + 1) + K (n + 1) [y(n + 1) h( xa (n + 1, 0)]
i = 0; x0 = xa (n + 1) (1.330)
1. increase i
2. compute:
The convergence criterion could be that the difference (xi xi1 ) is sufficiently
small (a certain threshold is used).
Notice that Hi (n) is evaluated at xi .
Once the iterations have converged for a certain i = m, then the estimation is
updated: xe (n + 1) = xm .
There are nonlinear cases where EKF fails, but an adequately designed IKF can
be used. See for example [106].
It has been shown in [10] that the iterated update of IKF is an instance of the
Gauss-Newton optimization method.
There are particle filter applications where the approximation of the posterior by a
single Gaussian PDF is adequate.
Two versions of Gaussian particle filter (GPF) were introduced in [59]. Resam-
pling is not required. The noise can be non-Gaussian and non-additive. The filtering
algorithm is parallelizable. Two standard examples were shown with GPF results
being compared to EKF, UKF, and SIS (particle filter). GPF outperforms EKF and
UKF. On the other hand SIS is slightly better than GPF, but it requires much more
computation.
Specific filtering circumstances that require large computational effort and/or short
time response have been tackled by using several processors. Distributed archi-
1.9 Other Bayesian Filters 117
tectures have been proposed. Parallelization techniques have been introduced. The
filtering algorithms have been modified in order to share the processing work.
Also, the characteristics of some applications suggest the use of algorithms with
some kind of partition inside. This could be the use of several models pertaining the
dynamics and/or the perturbations. How to manage several models give birth to a
number of alternatives.
In this context, the word mixture has been used with several meanings. One
has been already mentioned, when describing the Rao-Blackwellized particle filter.
Another meaning is connected with sums of Gaussians to approximate PDFs. In
addition the use of several filters could be named as mixtures.
As a way to organize this subsection, it has divided into two parts. The first is
devoted to Kalman filters and has relatively old roots. The second looks at particle
filters, where a lot of exploration activity exists.
It is typical in flight control systems to use a gain scheduling approach: the control
constants are switched according with the flight state. This is because the aircraft
dynamics is not the same when taking off, cruising, or landing. It is also natural
to use several models, and several Kalman filters. Multi-model Kalman filters have
been proposed from time ago. In the case of examples like the flight control, it is also
natural to switch from one filter to another (switching Kalman filter) [77], and one
of the problems is the switching mechanism, which must be bump-free.
In the switching Kalman filters a set of models is sequentially used, one after
one. There are other methods that use several models (or filters) in parallel. In this
case a mechanism is included in order to extract the output information, perhaps by
selecting the most successful filter in each moment, or by some kind of information
fusion. In the case of multiple target tracking it is natural to use several filters in
parallel, and the output information concerns all targets. For systems with a set of
interacting components, it has been proposed to use multiple interacting models [12].
In particular the so-called Gaussian sum filters [3] can be regarded as banks of
Kalman filters in parallel. The Gaussian sum is used to approximate the PDF of
interest: the transition PDF or the posterior PDF. Each Gaussian component of the
sum has an associated Kalman filter.
More in general, given a bank of Kalman filters, which could be structured as a
tree, a management unit could be added to adapt, by pruning or merging, the size
and structure of the bank to current signal processing needs.
Gaussian sum are universal approximators for PDFs. The idea of using Gaussian
sums in particle filters has been already introduced [60] as banks of particle filters.
Actually, three types of Gaussian sum particle filter (GSPF) are described in [60],
118 1 Kalman Filter, Particle Filter and Other Bayesian Filters
1.9.5 Combinations
The prediction capability of Gaussian processes (GP) could be used to predict the
next state and the observations in the context of a Kalman filter. Some authors [58]
express it as that GP can be applied directly to learn the transition and observation
models, in such a way that GP can be integrated into Kalman filters or particle filters.
Actually, the literature shows examples of Gaussian process extended Kalman
filters, Gaussian process UKF, Gaussian process particle filters, etc.
The scientific literature offers a rich variety of combinations of the described filtering
methods. Some of them are briefly cited below.
A Rao-Blackwellised unscented Kalman filter is introduced in [14], variance
reduction and lower computational cost is achieved.
A modification of the ensemble Kalman filter based on the unscented transform
is introduced in [68] and denoted as EnUKF.
The unscented particle filter was described in [109] (it was previously introduced
in a Cambridge University technical report). The idea was to use a bank of unscented
filters to obtain the importance proposal distribution.
In [98] the auxiliary extended and the auxiliary unscented Kalman particle filters
were introduced. This publication includes extensive comparisons with other filters.
A set of combinations of interacting multiple model (IMM) and other methods
is studied in [25]. In particular this article compares IMM-EKF with IMM-UKF,
IMM-GHF, and multiple model unscented particle filter (MM-UPF).
In [65] an iterated extended Kalman particle filter was introduced.
In the context of particle filters, it was proposed in [111] to use a bank of sigma-
point Kalman filters for the proposal distribution, and a Gaussian mixture for the
posterior, so there is no resampling.
Instead of calculating the matrix square-root of the covariance at each filtering iter-
ation, square-root forms of UKF and Central Difference KF directly propagate and
update the square root of the covariance using the Cholesky factors. The acronyms
1.9 Other Bayesian Filters 119
are SR-UKF and SR-CDKF. They are better concerning numerical stability and com-
putational cost [110].
Polar coordinates are more adequate than other coordinates for certain applications
(for instance, target tracking). There have been several efforts to formulate Kalman
filtering in polar [2, 116], or spherical coordinates [71].
When measuring 3D angular magnitudes, using inertial units, accelerometers,
magnetometers, etc. It is convenient to use quaternions in order to avoid singulari-
ties. There are formulations of the Kalman filter using quaternions, like in [90] and
references therein.
In mobile robotics and other applications, spatial grids are used. Some adaptations
of the filtering algorithms have been proposed to include grids, like in [43] and
references therein.
In [78] the use of Laplace method for the approximation of integrals in particle
filters has been proposed.
It has been proposed [28] to use block sampling strategies instead of one-at-a-time
sampling. This results in efficiency improvement.
1.10 Smoothing
Based on the dynamic model of the system, it is possible to predict future states. This
prediction will be as good as permitted by the process noise. The basic mathemat-
ical aspects about prediction can be expressed using the transition matrix, which is
introduced next.
120 1 Kalman Filter, Particle Filter and Other Bayesian Filters
The transition matrix (k) of the (noiseless) system x(n + 1) = A x(n) is such
that:
( j + 1) = A ( j) , j = 0, 1, 2, . . . (1.333)
with (0) = 1.
Now let us affirm that, based on measurements and model, the best prediction of
the state, for i < n is the following:
n
p (n|i) = (n, i) p (i|i) (n, i)T + (n, i) w (n, i)T (1.336)
j=i+1
In this last expression one can observe the effect of the process noise.
As a first step into smoothing theory, the simplest case is now considered: to obtain
a smoother estimation using the next measurement. The following mathematical
development is based on the innovations process.
The innovations could be written as follows:
For brevity purposes, let us denote the series of obtained measurements as:
Since x(n) and the innovations (n + 1) are jointly Gaussian with zero mean, the
following partition could be considered:
1.10 Smoothing 121
x(n)
(1.340)
(n + 1)
and then:
cov( x(n), (n + 1)
x S (n|n + 1) = x(n) + (n + 1) (1.341)
in (n + 1)
(n + 1) = A er
per (n) + w (1.343)
Then:
(n)) A T C T =
cov(x(n), (n + 1)) = cov( x(n), er
(1.344)
= cov( er (n)) A C = (n) A T C T
(n), er T T
In consequence:
with:
AT C T
F(n|n + 1) = (n) (1.346)
in (n + 1)
Let us show that that the optimal smoothing includes the Kalman gain K (n + 1).
From AKF filter, one has that:
CT K (n + 1)
= (1.347)
in (n + 1) M(n + 1)
Therefore:
(n)A T
F(n|n + 1) = K (n + 1) = G(n) K (n + 1) (1.348)
M(n + 1)
where x(n) and x(n + 1) should be computed with the Kalman filter.
122 1 Kalman Filter, Particle Filter and Other Bayesian Filters
There are three main types of smoothers [72], which could be briefly introduced as
follows:
Fixed-interval
It is desired to obtain:
x S (n|N ) , n = 0, 1, . . . , N 1 (1.350)
The data processing is done after all data are gathered. It is an off-line task. For
each n within 0..N 1, one wishes to obtain the optimal estimate of x(n) using
all the available measurements y(n), n = 0, 1, . . . , N .
Fixed-point
It is desired to obtain:
x S (n| j) , j = n + 1, n + 2, . . . (1.351)
The algorithms for these types of smoothers can be deduced from the one-stage
smoothing, or by augmentation of the state vector.
Let us continue with the expressions obtained for one-stage smoothing [8]. The three
types of smoothers can be implemented as follows:
Fixed-interval:
Some authors call this algorithm as the two-pass smoother, since there is a
forward pass with the Kalman filter and a backward pass using the expression
above. The algorithm is also known as the RTS (Rauch-Tung-Striebel) smoother
[87].
Mendel proposed in [72] a version of the two-pass smoother that has better com-
putational characteristics. A residual state vector is defined as follows:
j = N 1, N 2, . . . , 1 (1.357)
with: A p ( j) = A [I K ( j) C]
Another fixed-interval algorithm, less convenient for computation, is the forward-
backward filter proposed by Fraser and Potter [36]. Suppose you want to get a
smoothed value at m: xs (m). The algorithm runs the Kalman filter up to m, to
obtain a forward estimate, and runs a reverse system down to m,
to obtain a backward estimate. Both estimates are combined to yield the smoothed
value.
Fixed-point:
where: B( j) = B( j 1) G( j 1).
Fixed-lag:
x S (n + 1|n + 1 + N ) = A x(n|n + N ) +
+ B(n + 1 + N ) K (n + 1 + N ) (n + 1 + N |n + N )+ (1.360)
+ U (n + 1)[x(n|n + N ) x(n)]
124 1 Kalman Filter, Particle Filter and Other Bayesian Filters
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 5 10 15 20 25 30 35 40
sampling periods
with:
!
n+N
B(n + 1 + N ) = G(i) (1.361)
i=n+1
U (n + 1) = w A T 1 (n) (1.362)
%process noise
Sw=[12e-4 0; 0 6e-4]; %cov
sn=zeros(2,Nf);
sn(1,:)=sqrt(Sw(1,1))*randn(1,Nf);
sn(2,:)=sqrt(Sw(2,2))*randn(1,Nf);
%observation noise
Sv=[6e-4 0; 0 15e-4]; %cov.
on=zeros(2,Nf);
on(1,:)=sqrt(Sv(1,1))*randn(1,Nf);
on(2,:)=sqrt(Sv(2,2))*randn(1,Nf);
% system simulation preparation
x=[1;0]; % state vector with initial tank levels
u=0.4; %constant input
% Kalman filter simulation preparation
%space for matrices
K=zeros(2,2); M=zeros(2,2); P=zeros(2,2);
xe=[0.5; 0.2]; % filter state vector with initial values
%space for recording xa(n), xe(n)
rxa=zeros(2,Nf-1);rxe=zeros(2,Nf-1);
%behaviour of the system and the Kalman filter
% after initial state
% with constant input u
for nn=1:Nf-1,
%system simulation
xn=(A*x)+(B*u)+sn(nn); %next system state
x=xn; %system state actualization
ym=(C*x)+on(:,nn); %output measurement
%Prediction
xa=(A*xe)+(B*u); %a priori state
M=(A*P*A')+ Sw;
%Update
K=(M*C')*inv((C*M*C')+Sv);
P=M-(K*C*M);
xe=xa+(K*(ym-(C*xa))); %estimated (a posteriori) state
%recording xa(n), xe(n)
rxa(:,nn)=xa;
rxe(:,nn)=xe;
end;
%Smoothing-----------------------------
xs=zeros(2,Nf);
xs(:,Nf)=xe; %final estimated state
for nn=(Nf-1):-1:1,
G=(P*A')*inv(M);
xs(:,nn)=rxe(:,nn)+(G*(xs(:,nn+1)-rxa(:,nn)));
end;
%------------------------------------------
% display of state evolution
figure(3)
plot(xs(1,:),'r'); %plots xs1
hold on;
126 1 Kalman Filter, Particle Filter and Other Bayesian Filters
There are some smoothing algorithms that are based on augmentation of the state
vector. Here are two notable examples, as described in [40] and detailed in [96].
Fixed-lag:
Let us augment the state vector by delayed versions of the state:
x(n + 1) A 0 ...0 x(n) I
x (n + 1)
1 I 0 . . . 0 x (n)
1 0
2
x A (n + 1) =
x (n + 1) = 0 I . . . 0 x2 (n) + 0
w
xl (n + 1) 0... I 0 xl (n) 0
(1.363)
x(n)
x1 (n)
2
y(n) = [C 0 0 . . . 0]
x (n) + v (1.364)
xl (n)
where:
j (n + 1) = j1 (n) [A K (n)C ]T (1.368)
and:
0 (n) = (n) (1.370)
As lag increases the estimation error variance decreases. For l the filter
approaches the non-causal Wiener filter. Nearly optimal performance is obtained for
l two or three times the dominant time constants of the system.
In order to give an implementation example, a program has been developed.This
program is included in Appendix B. The first part of the program is very similar to
the previous program, since it includes the system simulation and a complete run of
the Kalman filter for state estimation. The second part of the program is the code for
fixed-lag smoothing. This part of the code has been included below.
Figure 1.59 compares as before the smoothed state estimates, continuous curves,
against the state estimation given by the Kalman filter, cross marks. The lag has been
set to L = 5 samples. The continuous curves have been left shifted to compensate
for the delay.
% augmented K
aK=zeros(2*(L+1),2);
% set of covariances
Pj=zeros(2,2,L);
%space for recording xs(n)
rxs=zeros(2,Nf-1);
jed=(2*L)+1; %pointer for last entries
%jed=1;
%action:
axa(1:2,1)=rxe(:,1); %initial values
for nn=1:Nf,
M=(A*P*A')+Sw;
N=(C*P*C')+Sv;
ivN=inv(N);
K=(A*P*C')*ivN;
aK(1:2,:)=K;
aK(3:4,:)=(P*C')*ivN;
for jj=1:L,
bg=1+(jj*2); ed=bg+1;
aK(bg:ed,:)=(Pj(:,:,jj)*C')*ivN;
end;
aux=[A-K*C]';
Pj(:,:,1)=P*aux;
for jj=1:L-1,
Pj(:,:,jj+1)=Pj(:,:,jj)*aux;
end;
axp=(aA*axa)+bu+aK*(rym(:,nn)-C*axa(1:2,1));
P=M-(K*N*K');
rxs(:,nn)=axp(jed:jed+1);
axa=axp; %actualization (implies shifting)
end;
%------------------------------------------
% display of state evolution
figure(3)
plot(rxs(1,L:end),'r'); %plots xs1
hold on;
plot(rxs(2,L:end),'b'); %plots xs2
plot(rxe(1,:),'mx'); %plots xe1
plot(rxe(2,:),'kx'); %plots xe2
axis([0 Nf 0 1]);
xlabel('sampling periods');
title('Kalman filter states(x) and Smoothed states(-)');
1.10 Smoothing 129
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 5 10 15 20 25 30 35 40
sampling periods
Fixed-point:
Notice that:
x (n) = x(n f ), n n f (1.373)
As before, the augmented state vector and the OPKF filter can be used to obtain
the smoothed estimates:
x(n + 1) A 0 x(n) K (n)
= + [y(n) C x(n)] (1.374)
x (n + 1) 0 I x (n) K (n)
where:
T
K (n) A 0 11 (n) 12 (n) C
= [C 11 (n) C T + v ]1
K (n) 0 I 21 (n) 22 (n) 0
(1.375)
Denote:
11 (n) 12 (n)
A (n) = (1.376)
21 (n) 22 (n)
with:
11 (n f ) 11 (n f )
A (n f ) = (1.378)
11 (n f ) 11 (n f )
with:
21 (n f ) = 11 (n f ) (1.384)
% augmented A matrix
aA=diag(ones(4,1)); aA(1:2,1:2)=A;
% augmented K
aK=zeros(4,2);
%space for recording xs(Nfix), P11(n)
rxs=zeros(2,Nf);
rP11=zeros(2,2,Nf);
%action:
P11=rP(:,:,Nfix); P21=P11; %initial values
axa(1:2,1)=rxe(:,Nfix); %initial values
axa(3:4,1)=rxe(:,Nfix); %initial values
for nn=Nfix:Nf,
M=(A*P11*A')+Sw;
N=(C*P11*C')+Sv;
ivN=inv(N);
K=(A*P11*C')*ivN;
Ka=(P21*C')*ivN;
aK(1:2,:)=K; aK(3:4,:)=Ka;
axp=(aA*axa)+bu+aK*(rym(:,nn)-C*axa(1:2,1));
axa=axp; %actualization
rP21(:,:,nn)=P21; %recording
rxs(:,nn)=axp(3:4,1);
P11=M-(K*N*K');
P21=P21*(A-(K*C))';
end;
%------------------------------------------
% display of smoothed state at Nfix
figure(3)
plot(rxs(1,Nfix:end),'r'); %plots xs1
hold on;
plot(rxs(2,Nfix:end),'b'); %plots xs2
axis([0 Nf 0 0.6]);
xlabel('sampling periods');
title('State smoothing at Nfix');
% display of Covariance evolution
figure(4)
subplot(2,2,1)
plot(squeeze(rP21(1,1,Nfix:end)),'k');
title('Evolution of covariance');
subplot(2,2,2)
plot(squeeze(rP21(1,2,Nfix:end)),'k');
subplot(2,2,3)
plot(squeeze(rP21(2,1,Nfix:end)),'k');
subplot(2,2,4)
plot(squeeze(rP21(2,2,Nfix:end)),'k');
132 1 Kalman Filter, Particle Filter and Other Bayesian Filters
0.6
0.5
0.4
0.3
0.2
0.1
0
0 5 10 15 20 25 30 35 40
sampling periods
-4 -5
x 10 x 10
5 6
4
4
3
2
2
1
0 0
0 10 20 30 40 0 10 20 30 40
-5 -4
x 10 x 10
8 6
6
4
4
2
2
0 0
0 10 20 30 40 0 10 20 30 40
Smoothing techniques are rapidly extending to more general contexts, not only the
linear system with Gaussian noises.
1.10 Smoothing 133
One of the approaches for the application of Bayes to smoothing is given by the
following equation, due to Kitagawa [57]:
p(xn |Y N ) = p(xn , xn+1 |Y N ) d xn+1 =
= p(xn+1 |Y N ) p(xn | xn+1 , Y N ) d xn+1 =
(1.385)
= p(xn+1 |Y N ) p(xn | xn+1 , Yn ) d xn+1 =
|Y N ) p(xn+1 |xn )
= p(xn |Yn ) p(xn+1p(x n+1 |Yn )
dxn+1
With a forward recursion you can compute p(xn |Yn ); and with a backward recursion
p(xn |Y N ) from p(xn+1 |Y N ).
The same methods already introduced, like EKF or UKF, are being applied for
smoothing in the presence of nonlinearities. Likewise, based on Bayesian expres-
sions, particle filters are being adapted for this scenario [13, 29]; one of the ways to
do so is the following approximations:
p(xn+1 |Y N ) p(xn+1 |xn ) p(xin+1 |xn )
p(xn+1 |Yn )
dxn+1 Wn+1|N (xin+1 ) p(xin+1 |Yn )
=
i
p(xin+1 |xn ) (1.386)
= Wn+1|N (xin+1 ) j j
Wn (xn ) p(xin+1 |xn )
i j
and:
p(xn |Yn ) Wn (xkn ) (xn xkn ) (1.387)
k
with:
p(xkn+1 |xin )
Wn|N (xin ) = Wn (xin ) Wn+1|N (xkn+1 ) j j
(1.389)
k Wn (xn ) p(xkn+1 |xn )
j
The smoothing algorithm would have a first forward pass, running the particle filter
up to N and for k = 1 : N store the sets {xik , Wk (xik )}. The second pass run backwards
from N to n, recursively computing the importance weights.
134 1 Kalman Filter, Particle Filter and Other Bayesian Filters
This section intends to give an idea of the varied applications of Bayesian filters. Of
course it is not a complete review. Anyway, the literature that has been selected for
this section offers a lot of references and hints, to help the reader to explore the area.
For this reason, most of the cited works are academic dissertations or papers with
large bibliographies.
In addition to the mentioned literature, there are books, like [48, 76], that include
a set of chapters with different Bayesian filter applications.
1.11.1 Navigation
1.11.2 Tracking
Like navigation, tracking is a typical application of Kalman filtering. One can easily
imagine scenarios related to radar and aircrafts, airports, missiles, air traffic, ships
and maritime traffic, etc.
Most examples that usually appear in the scientific literature are related to navi-
gation or tracking. The use of angles involves trigonometric functions; and the use
of distances involves square roots. Then, there are nonlinearities.
In addition to radar there are other tracking devices. Microphones, or ultrasonic
transducers could be used for air or underwater applications. Devices based on
infrared, temperature, pressure, etc., could be used for certain types of detection
1.11 Applications of Bayesian Filters 135
and tracking. Cameras could be employed for monitoring and security, face track-
ing, pedestrian tracking, etc.
An extensive review of maneuvering target tracking is [63], [64]. Another review,
more related to people tracking, is [124]. An interesting article about Bayesian loca-
tion estimation is [35]. Particle filters are applied for positioning problems in [45].
Tracking is being extended to several targets at the same time. Proposed solutions
involve multiple models, banks of filters, partitioning of particles, etc.
Suppose you are driving a futuristic car with a lot of sensors, including artificial
vision. Ultrasonic devices detect collision risk within 0.01 s. The vision device needs
1 s. processing time to detect an object. The car moves at 15 m/s, which is a moderate
speed. A dog crosses at 10 m. from the car. Should the on-board computer wait for
the vision?
Another scenario. This is a ship in a harbour. The ship requires protection. Sonar
detectors show that something is moving underwater towards the ship. Underwater
microphones detect the sound of a propeller. It would be important to know if this
sound comes from that thing. Also, it would be relevant to estimate if the motion of
that thing could correspond to a machine or something else.
Fusion problems deal with information from several sensors, perhaps with dif-
ferent delays. It is relevant to determine if different types of signals come from the
same object. Notice that sometimes it is difficult to locate an airplane based on its
sound, instinctively you look at the wrong place in the sky.
Lately, a typical fusion problem concerning several types of vehicles, is to combine
inertial units and GPS to determine own position [24].
Most of the reported fusion applications use Bayesian estimation. The main idea
is to combine sensors and mathematical models.
1.11.4 SLAM
Where in the world am I in this moment? If you are on the street you look around,
recognize some familiar buildings, and this is enough to know your position. If you
were a tourist some years ago, a typical procedure was to use a map and read street
names. Nowadays GPS solves the problem when you dont know the place. However,
it is not possible to use GPS under certain circumstances.
Suppose your mission is to study the sea floor in a certain ocean zone. There is
no GPS underwater. What you can do is to recognize some possible landmarks on
the floor, and then determine your position with respect to the landmarks. Starting
from this, it would be possible to obtain a map of other new landmarks, expanding
the exploration field and being able to move knowing where you are on this map.
136 1 Kalman Filter, Particle Filter and Other Bayesian Filters
p((x(n), m | Zn , Un , x0 ) (1.390)
Prediction
p(x(n),
m | Zn1 , Un , x0 ) =
= p(x(n)| x(n 1), u(n)) p(x(n 1), m|Zn1 , Un1 , x0 ) d x(n 1)
(1.393)
Update
Important contributions to solve the SLAM problem use EKF. More recently the
FastSLAM algorithm was introduced, based on particle filters. Bayesian information
filters have been also proposed [114].
SLAM is a core topic of the so-called probabilistic robotics [105].
1.11 Applications of Bayesian Filters 137
One of the most important signals for humans is speech. The research has contributed
with algorithms for speech enhancement. Increase of intelligibility, quality improve-
ment, noise rejection, etc., are objectives of such enhancement. The use of iterative
and sequential Kalman filter has been proposed in [38, 39]. A special modification
of the Kalman filter that considers properties of the human auditory system has been
investigated by [69]. A more recent contribution, using Bayesian inference is [70].
Speech enhancement is used in mobile phones, hearing aids, speech recognition,
etc.
Another important area is vibration monitoring, which concerns machinery and
structures like for instance, bridges-. The use of Bayesian filters, together with
signal classification methods, is an active research field. An example of this research
is [117].
1.11.6 Images
The Kalman filtering approach to image restoration has received a lot of attention.
Extended Kalman filters have been used for parameter estimation of the PSF. Also,
the degradation model has been put as a Gauss-Markov model, so the Kalman filter
can be applied to estimate the image to be recovered [23], [79]. In the recent years
the use of particle filters for image recovery is being considered [91].
Another application is the use of Kalman filtering for image stabilization using
models of the camera motion [32].
Many forecasting applications are based on the use of the ensemble Kalman filter.
For example in aspects of ocean dynamics [33], or concerning weather prediction
[4].
Other papers about Bayesian filters and weather prediction are [37]. And about
ocean dynamics [89].
Seismic signals are important for quake monitoring, analysis of civil structures,
and for oil reservoirs monitoring [54]. Bayesian recursive estimation for seismic
signals is investigated in [9].
138 1 Kalman Filter, Particle Filter and Other Bayesian Filters
Nowadays a lot of different electrical energy sources are being combined for daily
power supply. Fossil or nuclear power plants, wind, solar panels, water, waves,
tides...Good management systems are needed to coordinate sources, to obtain alter-
nate electrical power with correct voltage and frequency. Bayesian filtering is being
proposed to guarantee this behaviour [107, 122].
Other facets of electrical energy are consumption and price. This is treated in [47,
75] with Bayesian methods.
Bayesian filtering methods can be applied for biosignal monitoring and analysis [92,
102]. In [94] Bayesian filters are used for arrhythmia verification in ICU patients.
Automated health monitoring systems propose to monitor patients during their
activity at home. Bayesian methods for this application are investigated in [121].
A switching Kalman filter is proposed in [85] for neonatal intensive care.
1.11.10 Traffic
Vehicle traffic is an important part of modern life. Step by step, information tech-
nologies are coming into this scene through several entrances. One of the first con-
tributions was related to traffic prediction, based on models [22]. This was used for
semaphore managing. In certain countries this was also used to inform drivers about
the future status of parking facilities or road crowding.
Currently the target is to devise cars with no human driver. The transition to this
could be through intelligent highways and automatic convoys, perhaps with the help
of some kind of beacons.
Particle filters adapt quite naturally to traffic applications. See for instance
[30, 73].
In addition, Bayesian filters could help for robotized cars [5].
But it is not only a matter of cars. There are also ships and flying vehicles. Bayesian
methods are being applied for sea and air traffic [81]. The future would mix manned
and unmanned vehicles; it would be important to take into account, with adequate
models, the type of responses that could be expected from robots or from humans.
1.11 Applications of Bayesian Filters 139
It has been proposed to train neural networks with non-linear variants of the Kalman
filter. In [83] this is done with good results for price prediction using ARMA models.
The purpose of this section is to collect typical examples used in the scientific liter-
ature. Frequently these examples are employed to compare different filtering algo-
rithms. One of these typical examples, the body falling towards Earth, has already
been used in previous sections and will not be included here.
y(n)
(n) = arctan( ) + v(n) (1.397)
x(n)
There are examples considering range-only tracking, [88]. In this case the measure-
ment equation is similar to the falling body:
R(n) = x(n)2 + y(n)2 (1.398)
Other examples combine angle and range tracking, [45], so the process output is a
vector y = [R, ].
Angle-only tracking is studied in [56].
When dealing with tracking scenarios it is important to consider what happens
with observability, [100].
The tracking of maneuvering targets, like an aircraft, could involve accelerations
and the use of 3D coordinates.
The following population growth model has become a benchmark for filtering algo-
rithms, [7, 16, 29, 41, 57, 59]:
x(n 1)
x(n) = 0.5 x(n 1) + 25 + 8 cos(1.2 (n 1)) + w(n 1)
1 + x(n 1)2
(1.399)
x(n)2
y(n) = + v(n) (1.400)
20
1.12 Frequently Cited Examples 141
where:
w(n) N (0, 1) , v(n) N (0, 1) (1.401)
This is a time series that model fluctuations in the stock market prices, exchange
rates and option pricing, [52]:
where:
(n), v(n) N (0, 1) (1.404)
This is a synthetic time series that is useful for algorithm checking, [20, 66, 109]:
where the PDF of the noise s(n)could be of Gamma type, or Gaussian, etc. The value
of parameters could be: a = 0.5; w = 0.04
The measurement equation could be:
b x(n)2 + v(n), n 30
y(n) =
c x(n)2 2 + v(n), n > 30 (1.408)
The pendulum is a natural example of nonlinear system. A state space model can be
written as follows, [97]:
This is an interesting example for state estimation of constrained systems, [27, 93,
103].
1.13 Resources
1.13.1 MATLAB
1.13.1.1 Toolboxes
Dan Simon:
http://academic.csuohio.edu/simond/
Polypedal/kalman:
http://www.eecs.berkeley.edu/~sburden/matlab/
http://www.polypedal/doc/polypedal/kalman/index.html
Tim Bailey:
www-personal.acfr.usyd.edu.au/tbailey/
EnKF Ensemble Kalman Filter:
http://enkf.nersc.no/
Hannes Nickisch:
http://hannes.nickisch.org/
Bethge Lab:
http://bethgelab.org/
1.13.2 Internet
Kernel-Machines.org:
www.kernel-machines.org/
References
1. V.J. Aidala, Kalman filter behavior in bearings-only tracking applications. IEEE T. Aerosp.
Electron. Syst. 15(1), 2939 (1979)
2. V.J. Aidala, S.E. Hamnel, Utilization of modified polar coordinates for bearings-only tracking.
IEEE T. Autom. Control 28(3), 283294 (1983)
3. D.L. Alspach, H.W. Sorenson, Nonlinear Bayesian estimation using Gaussian sum approxi-
mation. IEEE T. Autom. Control 17(4), 439448 (1972)
4. J.L. Anderson, Ensemble Kalman filters for large geophysical applications. IEEE Control
Syst. Mgz. 6682 (2009)
5. N. Apostoloff, Vision based lane tracking using multiple cues and particle filtering. Masters
thesis, Australian National Univ., 2005
6. M.S. Arulampalam, B. Ristic, N. Gordon, T. Mansell, Bearings-only tracking of manoeuvring
targets using particle filters. EURASIP J. Appl. Signal Process. 2004(15), 23512365 (2004)
7. S. Arulampalam, S. Maskell, N. Gordon, T. Clapp, A tutorial on particle filters for on-line
non-linear/non-Gaussian Bayesian tracking. IEEE T. Signal Process. 50(2):174188 (2002)
8. S.P. Banks, Control Systems Engineering (Prentice-Hall, 1986)
9. E. Baziw, Application of Bayesian Recursive estimation for seismic signal processing. Ph.D.
thesis, Univ. British Columbia, 2007
10. B.M. Bell, F.W. Cathey, The iterated Kalman filter update as a Gauss-Newton method. IEEE
T. Autom. Control 38(2):294297 (1993)
11. J.L. Blanco, J. Gonzalez, J.A. Fernandez-Madrigal, Optimal filtering for non-parametric
observation models: applications to localization and SLAM. Int. J. Robot. Res. 29(14), 1726
1742 (2010)
12. H.A.P. Blom, Y. Bar-Shalom, The interacting multiple model algorithm for systems with
Markovian switching coefficients. IEEE T. Autom. Control 33(8), 780783 (1988)
13. M. Briers, A. Doucet, S. Maskell, Smoothing algorithms for state-space models. Ann. Inst.
Stat. Math. 62(1), 6189 (2010)
14. M. Briers, S.R. Maskell, R. Wright, A Rao-Blackwelised unscented Kalman filter, in Pro-
ceedings 6thIEEE International Conference on Information Fusion, pp. 5561 (2003)
15. J.V. Candy, Bayesian Signal Processing (Wiley, IEEE, 2009)
16. O. Capp, S.J. Godsill, E. Moulines, An overview of existing methods and recent advances
in sequential Monte Carlo. Proc. IEEE 95(5), 899924 (2007)
17. R. Casarin, J.M. Marin, Online data processing: comparison of Bayesian regularized particle
filters. Electron. J. Stat. 3, 239258 (2009)
18. C. Chang, R. Ansari, Kernel particle filter for visual tracking. IEEE Signal Process. Lett.
12(3), 242245 (2005)
19. Z. Chen, Bayesian filtering: form Kalman filters to particle filters, and beyond. Tech-
nical Report, McMaster Univ., 2003. http://www.dsi.unifi.it/users/chisci/idfric/Nonlinear_
filtering_Chen.pdf
20. Q. Cheng, P. Bondon, An efficient two-stage sampling method in particle filter. IEEE T.
Aerosp. Electron. Syst. 48(3), 26662672 (2012)
21. Y. Cheng, J.L. Crassidis, Particle filtering for sequential spacecraft attitude estimation, in
Proccedings AIAA Guidance, Navigation, and Control Conference (2004)
22. R. Chrobok, Theory and application of advanced traffic forecast methods. Ph.D. thesis, Univ.
Duisburg-Essen, Germany (2005)
References 145
23. S. Citrin, M.R. Azimi-Sadjadi, A full-plane block Kalman filter for image restoration. IEEE
T. Image Process. 1(4):488495 (1992)
24. J.L. Crassidis, Sigma-point Kalman filtering for integrated GPS and inertial navigation. IEEE
T. Aerosp. Electron. Syst. 42(2), 750756 (2006)
25. N. Cui, L. Hong, J.R. Layne, A comparison of nonlinear filtering approaches with an appli-
cation to ground target tracking. Signal Process. 85, 14691492 (2005)
26. F. Daum, Nonlinear filters: beyond the Kalman filter. IEEE A&E Syst. Mag. 20(8), 5769
(2005)
27. M.P. Deisenroth, M.F. Huber, U.D. Hanebeck, Analytic moment-based Gaussian process
filtering, in Proceedings 26thACM Annual International Conference Machine Learning, pp.
225232 (2009)
28. A. Doucet, M. Briers, Efficient block sampling strategies for sequential Monte Carlo methods.
J. Comput. Graph. Stat. 15(3):693711 (2006)
29. A. Doucet, S. Godsill, C. Andrieu, On sequential Monte Carlo sampling methods for Bayesian
filtering. Stat. Comput. 10, 197208 (2000)
30. M.C. Dunn, Applying particle filter and path-stack methods to detecting anomalies in network
traffic volume. Ph.D. thesis, Carnegie-Melon Univ., 2004
31. H. Durrant-Whyte, T. Bailey, Simultaneous localisation and mapping (SLAM): Part I. IEEE
Robot. Autom. Mgz. 99108 (2006)
32. S. Erturk, Real-time digital image stabilization using Kalman filters. Real-Time Imag. 8,
317328 (2002)
33. G. Evensen, The ensemble Kalman filter: theoretical formulation and practical implementa-
tion. Ocean Dyn. 53, 343367 (2003)
34. D. Fox, Adapting the sample size in particle filters through KLD-sampling. The Intl. J.
Robotics Research, 22, 12, 9851003 (2003)
35. D. Fox, J. Hightower, L. Liao, D. Schulz, G. Borriello, Bayesian filtering for location estima-
tion. IEEE Pervasive Comput. 2433 (2003)
36. D. Fraser, J. Potter, The optimum linear smoother as a combination of two optimum linear
filters. IEEE T. Autom. Control 14(4), 387390 (1969)
37. G. Galanis, P. Louka, P. Katsafados, G. Kallos, I. Pytharoulis, Applications of Kalman filters
based on non-linear functions to numerical weather prediction. Ann. Geophys. 24, 110 (2006)
38. S. Gannot, Speech processing utilizing the Kalman filter. IEEE Instrum. Meas. Mgz. 1014
(2012)
39. S. Gannot, D. Burshtein, E. Weinstein, Iterative and sequential Kalman filter-based speech
enhancement algorithms. IEEE T. Speech Audio Process. 6(4), 373385 (1998)
40. G.C. Goodwin, Adaptive Filtering Prediction and Control (Prentice-Hall, 1984)
41. N.J. Gordon, D.J. Salmond, A.F.M. Smith, Novel approach to nonlinear/non-Gaussian
Bayesian state estimation. IEE Proc.-F 140(2), 107113 (1993)
42. M.S. Grewal, A.P. Andrews, Applications of Kalman filtering in aerospace 1960 to the present.
IEEE Control Syst. Mag. 6978 (2010)
43. G. Grisetti, C. Stachniss, W. Burgard, Improved techniques for grid mapping with Rao-
Blacwellized particle filters. IEEE T. Robot. 32(1), 3446 (2007)
44. F. Gustaffson, Particle filter theory and practice with positioning applications. IEEE A&E
Mag. 25(7), 5381 (2010)
45. F. Gustafsson, F. Gunnarsson, N. Bergman, U. Forssell, J. Jansson, R. Karlsson, P.J. Nordlund,
Particle filters for positioning, navigation, and tracking. IEEE T. Signal Process. 50(2), 425
437 (2002)
46. F. Gustafsson, G. Hendeby, Some relations between extended and unscented Kalman filters.
IEEE T. Signal Process. 60(2), 545555 (2012)
47. S. Hamis, Dynamics of oil and electricity spot prices in ensemble stochastic models. Masters
thesis, Lappeenranta Univ. of Technology, 2012
48. A.C. Harvey, Forecasting, Structural Time Series Models and the Kalman Filter (Cambridge
University Press, 1990)
49. A.M. Hasan, K. Samsudin, A.R. Ramli, R.S. Azmir, S.A. Ismael, A review of navigation
systems (integration and algorithms). Australian J. Basic Appl. Sci. 3(2), 943959 (2009)
146 1 Kalman Filter, Particle Filter and Other Bayesian Filters
50. J.D. Hol, T.B. Schon, F. Gustafsson, On resampling algorithms for particle filters, in Proceed-
ings IEEE Nonlinear Statistical Signal Processing Workshop, pp. 7982 (2006)
51. K. Ito, K.Q. Xiong, Gaussian filters for nonlinear filtering problems. IEEE T. Autom. Control
45(5), 910927 (2000)
52. E. Jacquier, N.G. Polson, P.E. Rossi, Bayesian analysis of stochastic volatility models with
fat-tails and correlated errors. J. Econ. 122(1), 185212 (2004)
53. S.J. Julier, J.K. Uhlmann, Unscented filtering and nonlinear estimation. IEEE Proc. 92(3),
401422 (2004)
54. S. Kalla, Reservoir characterization using seismic inversion data. Ph.D. thesis, Louisiana State
Univ., 2008
55. R. Kalman, A new approach to linear filtering and prediction problems. ASME J. Basic Eng.
82, 3545 (1960)
56. R. Karlsson, Various topics on angle-only tracking using particle filters. Technical
Report, Linkping University, 2002. http://www.diva-portal.org/smash/get/diva2:316636/
FULLTEXT01.pdf
57. G. Kitagawa, Non-Gaussian state-space modeling of nonstationary time-series. J. Am. Stat.
Assoc. 82, 10321063 (1987)
58. J. Ko, D. Fox, GP-BayesFilters: Bayesian filtering using Gaussian process prediction and
observation models. Auton. Robots 27, 7590 (2009)
59. J.H. Kotecha, P.M. Djuric, Gaussian particle filtering. IEEE T. Signal Process. 51(10), 2592
2601 (2003)
60. J.H. Kotecha, P.M. Djuric, Gaussian sum particle filtering. IEEE T. Signal Process. 51(10),
26022612 (2003)
61. B.L. Kumari, K.P. Raju, V.Y.V. Chandan, R.S. Krishna, V.M.J. Rao, Application of extended
Kalman filter for a free falling body towards Earth. Int. J. Adv. Comput. Sci. Appl. 2(4),
134140 (2011)
62. S. Lakshmivarahan, D.J. Stensrud, Ensemble Kalman filter. IEEE Control Syst. Mag. 3446
(2009)
63. X.R. Li, V.P. Jilkov, Survey of maneuvering target tracking. part I: Dynamic models. IEEE T.
Aerosp. Electron. Syst. 39(4), 13331364 (2003)
64. X.R. Li, V.P. Jilkov, Survey of maneuvering target tracking. part II: Motion models and ballistic
and space targets. IEEE T. Aerosp. Electron. Syst. 46(1), 96119 (2004)
65. L. Liang-qun, J. Hong-bin, L. Jun-hui, The iterated extended Kalman particle filter, in Pro-
ceedings IEEE Intlernational Symposium on Communications and Information Technology,
pp. 12131216 (2005)
66. L. Liang-Qun, J. Hong-Bing, L. Jun-Hui, The iterated extended Kalman particle filter, in
Proceedings IEEE International Symposium Communications and Information Technology,
ISCIT 2005, vol. 2, pp. 12131216 (2005)
67. F. Lindsten, Rao-Blackwellised particle methods for inference and identification. Masters
thesis, Linkping University, 2011. Licenciate Thesis
68. X. Luo, I.M. Moroz, Ensemble Kalman filter with the unscented transform. Physica D: Non-
linear Phenomena 238(5), 549562 (2009)
69. N. Ma, Speech enhancement algorithms using Kalman filtering and masking properties of
human auditory systems. Ph.D. thesis, Univ. Otawa, 2005
70. C. Maina, Approximate Bayesian inference for robust speech processing. Ph.D. thesis, Drexel
Univ., 2011
71. M. Mallick, L. Mihaylova, S. Arulampalam, Y. Yan, Angle-only filtering in 3Dusing modified
spherical and log spherical coordinates, in Proceedings of 14th IEEE International Conference
on Information Fusion, pp. 18 (2001)
72. J.M. Mendel, Lessons in Digital Estimation Theory (Prentice-Hall, 1987)
73. L. Mihaylova, A. Hegyi, A. Gning, R. Boel, Parallelized particle and Gaussian sum particle
filters for large scale freeway traffic systems. IEEE T. Intell. Transp. Syst. 113 (2012)
74. T.P. Minka, A family of algorithms for approximate Bayesian inference. Ph.D. thesis, MIT,
2001
References 147
75. A. Molina-Escobar, Filtering and parameter estimation for electricity markets. Ph.D. thesis,
Univ. British Columbia, 2009
76. V.M. Moreno, A. Pigazo, Kalman Filter Recent Advances and Applications (Intech, 2009)
77. K.P. Murphy, Switching Kalman filters. Technical report, Univ. Berkeley, 1998
78. C. Musso, P.B. Quang, F. Le Gland, Introducing the Laplace approximation in particle filtering,
in Proceedings of 14th International Conference on Information Fusion, pp. 290297 (2001)
79. R. Nagayasu, N. Hosoda, N. Tanabe, H. Matsue, T. Furukawa, Restoration method for
degraded images using two-dimensional block Kalman filter with colored driving source,
in Proceedings IEEE Digital Signal Processing Workshop, pp. 151156 (2011)
80. J.C. Neddermeyer, Sequential Monte Carlo methods for general state-space models. Masters
thesis, University Ruprecht-Karls Heidelberg, 2006. Diploma dissertation
81. P-J. Nordlund, Efficient estimation and detection methods for airborne applications. Ph.D.
thesis, Linkping University, 2008
82. M. Norgaard, N.K. Poulsen, O. Ravn, New developments in state estimation for nonlinear
systems. Automatica 36, 16271638 (2000)
83. M.A. Oliveira, An application of neural networks trained with Kalman filter variants (EKF and
UKF) to heteroscedastic time series forecasting. Appl. Math. Sci. 6(74), 36753686 (2012)
84. M.K. Pitt, N. Shephard, Filtering via simulation: auxiliary particle filters. J. Am. Stat. Assoc.
94(446), 590599 (1999)
85. J.A. Quinn, Bayesian condition monitoring in neonatal intensive care. Ph.D. thesis, University
of Edinburgh, 2007
86. K. Radhakrishnan, A. Unnikrishnan, K.G. Balakrishnan, Bearing only tracking of maneuver-
ing targets using a single coordinated turn model. Int. J. Comput. Appl. 1(1), 2533 (2010)
87. H.E. Rauch, F. Tung, C.T. Striebel, Maximum likelihood estimates of linear dynamic systems.
J. Am. Inst. Aeronaut. Astronaut. 3(8), 14451450 (1965)
88. B. Ristic, S. Arulampalam, N. Gordon, Beyond the Kalman filter. IEEE Aerosp. Electron.
Syst. Mgz. 19(7), 3738 (2004)
89. M. Rixen et al., Improved ocean prediction skill and reduced uncertainty in the coastal region
from multi-model super-ensembles. J. Marine Syst. 78, S282S289 (2009)
90. A.M. Sabatini, Quaternion-based extended Kalman filter for determining orientation by iner-
tial and magnetic sensing. IEEE T. Biomed. Eng. 53(7), 13461356 (2006)
91. S.I. Sadhar, A.N. Rajagopalan, Image recovery under nonlinear and non-Gaussian degrada-
tions. J. Opt. Soc. Am. 22(4), 604615 (2005)
92. R. Sameni, M.B. Shamsollahi, C. Jutten, G.D. Clifford, A nonlinear Bayesian filtering frame-
work for ECG denoising. IEEE T. Biomed. Eng. 54(12), 21722185 (2007)
93. S. Srkk, Bayesian Filtering and Smoothing, vol. 3 (Cambridge University Press, 2013)
94. O. Sayadi, M.B. Shamsollahi, Life-threatening arrhythmia verification in ICU patients using
the joint cardiovascular dynamical model and a Bayesian filter. IEEE T. Biomed. Eng. 58(10),
27482757 (2011)
95. B. Sherlock, B. Herbstm Introduction to the Kalman Filter and Applications (University of
Stellenbosch, South Africa, 2002). http://dip.sun.ac.za/~hanno/tw796/lesings/kalman.pdf
96. D. Simon, Optimal State Estimation (Wiley, 2006)
97. D. Simon, Kalman filtering with state constraints: a survey of linear and nonlinear algorithms.
IET Control Theory Appl. 4(8), 13031318 (2010)
98. L. Smith, V. Aitken, The auxiliary extended and auxiliary unscented Kalman particle filters,
in Proceedings IEEE Canadian Conference on Electrical and Computer Engineering, pp.
16261630. Vancouver (2007)
99. A. Solin, Cubature integration methods in non-linear Kalman filtering and smoothing. Mas-
ters thesis, Aalto University School of Science and Technology, Finland, 2010. Bachelors
thesis
100. T.L. Song, Observability of target tracking with range-only measurements. IEEE J. Oceanic
Eng. 24(3), 383387 (1999)
101. J.R. Stroud, N.G. Polson, P. Mller, Practical filtering for stochastic volatility models, ed. by
S. Koopmans, A. Harvey, N. Shephard. State Space and Unobserved Component Models, pp.
236247 (Cambridge University Press, 2004)
148 1 Kalman Filter, Particle Filter and Other Bayesian Filters
102. M. Tarvainen, Estimation methods for nonstationary biosignals. Ph.D. thesis, University of
Kuopio, Finland, 2004
103. B.O. Teixeira, J. Chandrasekar, L.A. Trres, L.A. Aguirre, D.S. Bernstein, State estimation
for linear and non-linear equality-constrained systems. Int. J. Control 82(5), 918936 (2009)
104. G.A. Terejanu, Tutorial on Monte Carlo techniques. Technical report, University of Buffalo,
2008. together with other tutorials. https://cse.sc.edu/~terejanu/files/tutorialMC.pdf
105. S. Thrun, W. Burgard, D. Fox, Probabilistic Robotics (MIT Press, 2005)
106. S. Tully, H. Moon, G. Kantor, H. Choset, Iterated filters for bearing-only SLAM, in Proceed-
ings IEEE Intl. Conf. on Robotics and Automation, pp. 14421448 (2008)
107. G. Valverde, V. Terzija, Unscented Kalman filter for power system dynamic state estimation.
IET Gener. Trans. Distrib. 5(1), 2937 (2010)
108. R. Van der Merwe, Sigma-point Kalman Filters for probabilistic inference in dynamic state-
space models. Ph.D. thesis, Oregon Health & Science University, 2004
109. R. Van der Merwe, N. Freitas, A. Doucet, E. Wan, The unscented particle filter. Adv. Neural
Inf. Process. Syst. 13 (2001)
110. R. Van der Merwe, E. Wan, Efficient derivative-free Kalman filters for online learning, in
Proceedings European Symposium on Artificial Neural Networks, pp. 205210 (Bruges, 2001)
111. R. Van der Merwe, E. Wan, Gaussian mixture sigma-point particle filters for sequential prob-
abilistic inference in dynamic state-space models, in Proceedings IEEE International Con-
ference Acoustics, Speech and Signal Processing, vol. 6, pp. 701704 (2003)
112. M. Verhaegen, P. van Dooren, Numerical aspects of different Kalman filter implementations.
IEEE T. Autom. Control 31(10), 907917 /(1986)
113. N. Vlassis, B. Terwijn, B. Krose, Auxiliary particle filter robot localization from high-
dimensional sensor observations, in Proceedings IEEE International Conference Robotics
and Automation, vol. 1, pp. 712 (2002)
114. M.R. Walter, Sparse Bayesian information filters for localization and mapping. Ph.D. thesis,
MIT, 2008
115. E.A. Wan, R. Van der Merwe, The unscented Kalman filter, ed. by S. Haykin. Kalman Filtering
and Neural Networks, Chap. 13, pp. 221280 (Wiley, 2001)
116. D. Wang, H. Hua, H. Cao, Algorithm of modified polar coordinates UKF for bearings-only
target tracking, in Proceedings IEEE International Conference Future Computer and Com-
munication, vol. 3, pp 557560 (Wuhan, 2010)
117. K. Wang, Vibration monitoring on electrical machines using Vold-Kalman filter order tracking.
Masters thesis, Univ. Pretoria, 2008
118. Q. Wang, L. Xie, J. Liu, Z. Xiang, Enhancing particle swarm optimization based particle filter
tracker, in Computational Intelligence, LNCS 4114/2006, pp. 12161221 (Springer Verlag,
2006)
119. G. Welch, G. Bishop, An introduction to the Kalman filter. Technical report, UNC-Chapel
Hill, TR, 2006
120. N. Whiteley, A.M. Johansen, Recent developments in auxiliary particle filtering, eds. by B.
Cemgil, Chiappa. Bayesian Time Series Models (Cambridge University Press, 2010)
121. D.H. Wilson. Assistive intelligent environments for automatic health monitoring. Ph.D. thesis,
Carnegie-Mellon Univ., 2005
122. R.A. Wiltshire, G. Ledwich, P. OShea, A Kalman filtering approach to rapidly detecting
modal changes in power systems. IEEE T. Power Syst. 22(4), 16981706 (2007)
123. Y. Wu, D. Hu, M. Wu, X. Hu, A numerical-integration perspective of Gaussian filters. IEEE
T. Signal Process 54(8), 29102921 (2006)
124. A. Yilmaz, Object tracking: a survey. ACM Comput. Surveys 38(4), 145 (2006)
125. J. Yu, Y. Tang, X. Chen, W. Liu, Choice mechanism of proposal distribution in particle filter,
in Proceedings IEEE World Congress Intelligent Control and Automation, pp. 10511056
(2010)
126. A.Z. Zambom, R. Dias, A Review of Kernel Density Estimation with Applications to Econo-
metrics (Universidade Estadual de Campinas, 2012). arXiv preprint arXiv:1212.2812
Chapter 2
Sparse Representations
2.1 Introduction
From the point of view of sparsity, l0 is the most appropriate norm, while l2 says
almost nothing. On the other hand, from the point of view of mathematical analysis,
finding the sparsest solution is an optimization problem and in this context, the l2
norm is a well know terrain while l0 is unusual.
As we shall see, our efforts will concentrate on using the l1norm, which is linked
to sparsity and is easier for mathematical work.
The main focus of this section is put on methods for solving sparsity optimization
problems. The basic literature is [3, 27, 76, 177].
Ax = b (2.2)
In general this minimization would not enforce sparsity. Some suitable terms
should be added, as we shall see.
In the case of undetermined system, with m < n, there would be in principle
infinite candidate solutions. A typical proposal is to choose the solution with the
minimum l2 norm, so the problem is:
Again, in general this solution would be not sparse. The problem should be stated
with other terms.
But, why it would be interesting to obtain sparse solutions?
If you are trying to establish a model in a regression problem, a sparse solution will
reveal the pertinent variables as those with non-zero values. This helps to identify
and select the correct variables (in the specialized literature variable selection, or
feature selection, is a frequently cited issue).
When dealing with signals, it is sometimes desired to transform a set of mea-
surements b into a sparse vector x. Later on, the vector x could be used to exactly
reproduce b, or, in certain applications it would be sufficient with an approximate
recovery. You may remember from past chapters, that this was frequently the case
with wavelets, where most information was contained in large entries of x, while other
entries were negligible. Then, it would make sense to consider an approximation of
the form:
A x b < (2.4)
where some of the entries of x were set to zero. This is a sparse approximation.
Another case of interest is the design of digital filters with a minimum number of
non-zero coefficients, [17].
There is a method, called Basis Pursuit (BP), which replaces l0 with l1, so now
the problem is:
Under certain conditions BP will find a sparse solution or even the sparsest one.
Another method, called the Lasso, states the following problem:
This method can be used in the regression context, and for undetermined systems
to find sparse approximations. Like BP, under certain conditions it will find the
sparsest solution.
Just in the phrases above, the term norm has been taken with somewhat excessive
license. A function is called a norm if:
(a) x = 0 if and only if x = 0.
(b) x = || x for all and x
(c) x + y x + y for all x and y (triangle inequality)
If only (b) and (c) hold the function is called a semi-norm.
If only (a) and (b) hold the function is called a quasi-norm.
The functions already taken as measures have the form:
n
x p = ( |xi | p )1/ p (2.5)
i=1
B p (r ) = {x | x p r } (2.6)
Then the solution that minimizes the norm lp is obtained at the intersection of the
ball and the line L.
Figure 2.1 shows three cases, corresponding to the intersection of the line and the
balls B1/2 , B1 , and B2 . Notice that the intersection of the line L with the vertical axis
gives a sparsest solution (only one nonzero entry).
The left hand plot, (a), in Fig. 2.1 makes clear that quasi-norms with 0 < p < 1
promote sparsity. The central plot, (b), shows that usually l1 promote sparsity (unless
L was parallel to a border of B1 ). Finally, the right hand plot shows that l2 do not
promote sparsity.
This simple example can be generalized to problems with more dimensions.
There are other functions that promote sparsity, like for instance 1 exp(|x|), or
log(1 + |x|), or |x|/(1 + |x|) [76].
Fig. 2.1 The line L and, a the ball B1/2 , b the ball B1 , c the ball B2
156 2 Sparse Representations
There are several ways for solving the sparsity optimization problems. The BP
problem can be tackled with linear programming. The Lasso problem can be treated
with quadratic programming, but there is a better approach. In certain scenarios it
would be more appropriate to use first order methods.
Let us enter into specific details.
First, notice that x1 = |x1 | + |x2 | + + |xn | . One could use a set of auxiliary
variables t1 , t2 , . . . , tn to obtain an equivalent problem:
minimize (t1 + t2 + . . . + tn ),
Next, each inequality can be expressed as two inequalities; for example, |x1 | t1
is equivalent to: x1 t1 , x1 t1 . Then, the problem can be expressed in the
following compact form:
minimize (t1 + t2 + . . . + tn ),
subject to:
Ix It 0
Ix + It 0
Ax = b (2.7)
xi = pi n i (2.8)
2.2 Sparse Solutions 157
with:
xi i f xi > 0 xi i f xi < 0
pi = and n i =
0 else 0 else
Then: x1 = p + n
With these changes, the problem now would be:
which is LP.
The reader would find in [53] programs in MATLAB for basis pursuit, using the
linprog( ) function.
Of course, other methods, like for instance interior-point methods, can also be
applied to solve this problem.
2.2.3.2 Lasso
penalized form:
minimize Ax b22 + x1 (2.9)
constraint form:
One of the good effects of ridge regularization is a certain shrinking of the solution
components. According with [166], higher values of force these components to be
more similar to each other.
In 1996, Tibshirani [176] introduced the method Least Absolute Shrinkage and
Selection Operator (LASSO). A l1 penalty is added (2.9).
There are analysis of large data and data mining scenarios where it is important,
as said before, to assess which variables are more important, more influential. The
l1 penalty has the advantage of promoting sparsity. Contrary to the l2 penalty, the l1
penalty leads to solutions with many zero components, and this helps to determine
the relevant variables. This is a main reason of Lasso popularity.
As we shall see in the next sections, the l1 penalty (and the l1 norm) plays an
important role both in compressed sensing and image processing. In other contexts
[156], l1 is used for robust statistics in order to obtain results insensitive to outliers
[187], and for maximum likelihood estimates in case of Laplacian residuals.
It is also shown, in an appendix of [187], that the two formulations:
are equivalent.
With respect to analytical solutions, recall that the least squares problem in unde-
termined systems has the following solution:
x = (A T A)1 A T b (2.10)
x = (A T A + I )1 A T b (2.11)
In the case of Lasso, finding the solution is more involved, and many methods have
been proposed. Indeed, it can be treated as a Quadratic Programming (QP) problem
(see appendix on optimization). In fact, Tibshirani proposed two ways of stating the
Lasso as QP [166, 176], much in the line described above for expressing the basis
pursuit as LP. Concretely, one of the proposed ways uses a +/ split, and so the
Lasso is expressed as follows:
which is a QP problem.
Nowadays there are better algorithms to solve Lasso problems. A popular one
is LARS (Least Order Regression), which was introduced in 2004 [75]. It is a
stepwise iterative algorithm that starts with c = 0 (constraint form), and detects the
most influential variables as c increases. Figure 2.2 shows an example of the paths
of the solutions when c increases: for small values of c there is only one non-zero
component of x, when c reaches a certain value a second non-zero component rises,
2.2 Sparse Solutions 159
200
100
-100
then for larger c a third non-zero component rises, and so on. Notice that components
with negative values can also appear. The next paragraphs intend now to introduce
LARS in more detail.
Let us follow the scheme of [75] in order to explain LARS. A first point is that
the LARS algorithm relates to the Forward Selection (FS) algorithm described in
[185] (1980). The FS algorithm applies the following steps:
rj = Aj xj b (2.12)
Find the column Ak most correlated to r j and perform a linear regression to obtain
xk , obtain a new residue:
r k = Ak x k r i (2.13)
Repeat until a stop criterion is reached (for instance in terms of the statistical
characteristics of the residual).
J = { i, j, k, l...} = { ji } (2.14)
A J = { Ai , A j , Ak , Al ...) (2.15)
In [75] an example was studied with 10 columns of data from 442 diabetes patients.
Using forward stagewise, a solution path figure (like Fig. 2.2) was obtained, after 6000
stagewise steps. The article remarks that the cautious nature of the algorithm, with
many small steps, implies considerable computational effort. One of the benefits of
LARS is that it would require much less steps.
Like the others, the LARS algorithm begins with x = 0, r = b, and a model
0 = 0. It finds the A j most correlated with r, and then it augments the model as
follows:
1 = 0 + 1 A j (2.16)
This step will be the largest possible until a third Al is found such as A j , Ak and Al
had the same correlation with r2 = b 2 . The next step will proceed equiangulary
between the three already selected A columns. And so on.
According with [75] the equiangular direction can be computed as follows:
2.2 Sparse Solutions 161
Denote:
Then:
u J = A J J , where J = Q G 1 1 J
A TJ u J = Q 1 J , and u J 22 = 1
Denote:
a = AT u J
c = A T (b J ) (current correlation)
C = max{|ci |} (2.19)
i
Then:
C ci C + ci
= min+ , (2.20)
i Q ai Q + ai
(min+ means that the minimum is taken over only positive components).
In order to use LARS for the Lasso problem it is necessary to add a slight modifica-
tion: if an active component passes through zero, it must be excluded from the active
set. It may happen that later on this variable would enter again into the active set.
When that modification is included, the LARS algorithm is able to find all the
solutions of the Lasso problem.
A simple, direct implementation of the LARS algorithm is offered in the Program
2.1, which deals with a table of diabetes data. The reader can edit the lasso_f flag
(see the code) in order to run LARS, or LASSO. The program is based on the infor-
mation and listings provided by [1], which has a link to the diabetes data (Stanford
University). Figure 2.3 shows the solution paths obtained with LARS.
Figure 2.4 shows the solution paths obtained with LASSO, which have noticeable
differences compared to the LARS solution paths.
162 2 Sparse Representations
1000
800
600
400
200
-200
-400
-600
-800
-1000
0 500 1000 1500 2000 2500 3000 3500
1000
800
600
400
200
-200
-400
-600
-800
-1000
0 500 1000 1500 2000 2500 3000 3500
for k=1:length(J),
Aj(:,k)=signJ(k)*Aj(:,k);
end;
G=Aj'*Aj;
Ij=ones(length(J),1);
iG=inv(G);
Q=1/sqrt(Ij'*iG*Ij);
wj=Q*iG*Ij;
uj=Aj*wj;
a=A'*uj;
if nav==maxnav, gamma_av =maxc/Q; %unusual last stage
else
gamma=zeros(LcomplJ,2); %min of two terms
for k=1:LcomplJ,
n=complJ(k);
gamma(n,:)=[((maxc-c(n))/(Q-a(n))),...
((maxc+c(n))/(Q+a(n)))];
end;
%remove complex elements, reset to Inf
[pi,pj]=find(0~=imag(gamma));
for nn=1:length(pi), gamma(pi(nn),pj(nn))=Inf; end;
% find minimum
gamma(gamma<=0) = Inf;
mm=min(min(gamma)); gamma_av=mm;
[maxc_ix,aux]=find(gamma==mm);
end;
%update coeff estimate:
baux(J)=beta(J)+gamma_av*diag(signJ)*wj;
bkloop=0;
Jold=J;
% The LASSO option ------------
if lasso_f==1,
vls=1; %valid sign
dh=diag(signJ'*wj);
gamma_c = -beta(J)./dh;
%remove complex elements, reset to Inf
[pi,pj]=find(0~=imag(gamma_c));
for nn=1:length(pi), gamma_c(pi(nn),pj(nn))=Inf; end;
% find minimum
gamma_c(gamma_c <= 0) = Inf;
mm=min(min(gamma_c)); gamma_w=mm;
[gamma_w_ix,aux]=find(gamma_c==mm);
%Lasso modification:
if isnan(gamma_w), gamma_w=Inf; end;
if gamma_w < gamma_av,
gamma_av = gamma_w;
baux(J)=beta(J)+gamma_av*diag(signJ)*wj;
J(gamma_w_ix)=[]; %delete zero-crossing element
nav=nav-1;
2.2 Sparse Solutions 165
vls=0;
end;
end;
% ------------------------------
norm1=norm(baux(Jold),1);
if oldnorm1<=t && norm1>=t,
baux(Jold)=beta(Jold)+Q*(t-oldnorm1)*diag(signJ)*wj;
bkloop=1; %for loop break
end;
oldnorm1=norm1;
mu=mu+ gamma_av*uj;
beta(Jold)=baux(Jold);
if bkloop==1, break; end %while end
end;
beta(ni,:)=beta';
ni=ni+1;
end;
% display --------------------------------------
[bm,bn]=size(hbeta);
aux=[0:100:nitmax];
q=0; %color switch
figure(1)
hold on; q=0;
for np=1:bn,
if q==0,
plot(aux,hbeta(:,np),'r',aux,hbeta(:,np),'k.');
axis([0 nitmax -1000 1000]);
if lasso_f==0,
title('LARS, diabetes set');
else
title('LASSO, diabetes set');
end;
end;
if q==1,
plot(aux,hbeta(:,np),'b',aux,hbeta(:,np),'m.');
axis([0 nitmax -1000 1000]); end
if q==2,
plot(aux,hbeta(:,np),'k',aux,hbeta(:,np),'b.');
axis([0 nitmax -1000 1000]); end
q=q+1; if q==3, q=0; end;
end;
grid;
1
x = x (2.22)
1+
In this last expression the soft-thresholding operator S/2 is used, its values are:
x + (/2) , i f x /2
S/2 (x) = 0 , i f |x| < /2 (2.24)
x (/2), i f x /2
The action of this operator is expressed in Fig. 2.5. This operator appears in other
optimization contexts.
In two most cited papers, [201, 206], the oracle properties of Lasso were studied.
These articles include relevant references to previous work. The desired oracle prop-
erties are that the right model is identified, and that it has optimal estimation rate.
It may happen that, for a certain value of (the regularization parameter), a good
estimation error was achieved but with an incorrect (inconsistent) model.
One of the sections of [206] shows that the Lasso variable selection could be
inconsistent. In coincidence with this, [201] writes that, in general, if an irrelevant
Ak is highly correlated with columns Ai of the true model, Lasso may not be able
to distinguish it from the true model columns. [201] introduces an irrepresentable
condition, which resembles the following constraint:
This last expression implies that relevant variables were not too highly correlated
with nuisance variables.
The irrepresentable condition of [201] is almost necessary and sufficient for Lasso
to select the true model. The lesson for experiment designers is to avoid spurious
variables that may incur in high correlation with relevant variables.
In order to get better oracle properties, the adaptive Lasso was introduced in
[206]. The original constrained Lasso is transformed to:
n
|x i |
minimize Ax b22 +
i=1
|xi |
(where > 0)
Notice that a weighting has been introduced in the penalty term. This weighting
is data-dependent. It is shown in [206] that consistency in variable selection can be
achieved.
By the way, the reader is referred to [206] for an interesting comparison of the
adaptive Lasso and the nonnegative garotte. The garotte finds a set of scaling factors
c j for the following problem:
2
n
n
minimize b A j x j c j
+ cj
j=1 j=1
2
where c j 0).
In some problems it is known that the variables belong to certain groups, and it
is opportune to shrink the members of a group together. The group Lasso has been
introduced in [195]. Suppose the n variables are divided into L groups. Denote Ak
the matrix obtained from A by selecting the columns corresponding to the k-th group.
Likewise, xk is the vector with components being the variables of the k-th group.
The group Lasso set the problem as follows:
L
minimize Ax b2 + 2
n k xk 2
k=1
In another most cited paper, [207] introduced the elastic net, which is a two-step
method. For the first step, the paper introduces the nave elastic net as the following
problem:
minimize Ax b22 + 1 x1 + 2 x22
Denote as x the estimate obtained by the first step. The second step is just the
following correction:
x
x = (2.26)
1 + 2
The elastic net can be regarded as compromise between ridge regression and Lasso.
Because of the quadratic term, the elastic net penalty is strictly convex.
As it was shown in Fig. 2.2, the solutions of the Lasso problem follow polygonal
paths. This fact was observed in [141], in the year 2000. Based on this fact, the paper
proposed a homotopy algorithm to solve the problem. It uses an active set, which
is updated through the addition and removal of variables. The LARS algorithm,
introduced four years later, is an approximation to the homotopy algorithm (see the
brief historical account given in [71]). Later on, year 2006, it was shown, [72], that
the LARS/homotopy methods, first proposed for statistical model selection, could
be useful for sparsity research.
With relative frequency new variants of the LARS and homotopy algorithms
appear in the literature, and different aspects of them are investigated, like for instance
[123] on the number of linear segments of the solution paths. See [107] for some
variants of group Lasso in the context of structured sparsity.
is rewritten as:
1
minimize Q(x) = b [A, A][x+ ; x ]22 + 1T (x+ + x
2
2.2 Sparse Solutions 169
with: x+ 0 ; x 0
Now, denote:
z = [x+ ; x ] , c = 1 + [A T b; A T b], and:
AT A AT A
B = (2.27)
A T A AT A
Then:
1 T
Q(z) = z B z + cT z (2.28)
2
and the gradient: z Q(z) = B z + c
Therefore, a steepest descent algorithm can be enounced:
zk+1 = zk k z Q(zk ) (2.29)
It was shown in [81] that the computation of the matrix B can be done more econom-
ically than its size suggests, and that the gradient of Q requires one multiplication
each by A and A T . No inversion of matrices is needed.
There is an important family of methods in connection with the iterative shrink-
age/thresholding (IST) algorithm. Initially, IST was introduced as an EM algorithm
for image deconvolution [136]. According with [81], it can be used for Lasso prob-
lems and only requires matrix-vector multiplications involving A and A T . Let us
describe the idea; suppose you have the following composite problem:
1
xk+1 = arg min{ x uk 22 + g(x)} (2.32)
x 2 k
In the case of g(x) being x1 , which is separable, then there is a component-wise
closed-form solution:
1
(xk+1 )i = arg min{ ((x)i (uk )i )2 + |(x)i |} = ST ((uk )i ) (2.33)
x 2 k
were we used (x)i to denote the i-th component of x. The last, right-hand term
includes a soft-thresholding operator ST (y):
y + T , i f y T
ST (y) = 0 , i f |y| < T (2.34)
y T, i f y T
with: T = k
Therefore, a simple iterative shrinkage/thresholding algorithm, named ISTA, has
been devised [20]. There are two parameters to specify, and k . The research has
proposed a number of strategies for this specification.
Another, equivalent form of expressing the soft-thresholding operator is:
then:
f (xk ) = A T (b Axk )
2
original signal
1.5
0.5
0
0 10 20 30 40 50 60 70 80 90 100
0.2
-0.2
0 10 20 30 40 50 60 70 80 90 100
2
recovered signal, using ISTA
1.5
0.5
0
0 10 20 30 40 50 60 70 80 90 100
1.6
1.5
1.4
1.3
1.2
1.1
0.9
0 50 100 150 200 250 300
else
z(n)=p(n)+T;
end;
end;
end;
end;
figure(1)
subplot (3,1,1)
[i,j,v]=find(x);
stem(j,v,'kx'); hold on;
axis([0 100 -0.1 2]);
plot([0 100],[0 0],'k');
title('original signal');
subplot(3,1,2)
stem(y,'kx'); hold on;
axis([0 L -0.2 0.5]);
plot([0 L],[0 0],'k');
title('observed signal');
subplot(3,1,3)
stem(z,'kx');
axis([0 100 -0.1 2]);
title('recovered signal, using ISTA')
figure(2)
plot(J,'k')
title('Evolution of objective function')
Since the ISTA algorithm involves a gradient descent, it is a good idea to use
the Nesterovs accelerated gradient descent (see section on gradient descent in the
appendix on optimization). The result is the fast iterative shrinkage/thresholding
algorithm (FISTA), which was introduced in 2009 [20].
While ISTA can be expressed as:
1
xk+1 = S/k (xk f (xk )) (2.38)
k
1
xk+1 = S/L (yk f (yk )) (2.39)
L
1 + 1 + 4tk2
tk+1 = (2.40)
2
tk 1
yk+1 = xk + (xk+1 xk ) (2.41)
tk+1
174 2 Sparse Representations
There are direct generalizations of the ISTA and FISTA algorithms by replacing
the soft-thresholding operator with the proximal operator. This is frequent in the most
recent literature. In particular, ISTA is equivalent to the proximal
gradient method.
The convergence rate of subgradient methods is O(1/ k), proximal gradient
methods obtain O(1/k), and accelerated proximal gradient methods further reduce it
to O(1/k 2 ).
Notice that an example of projected subgradient iterations for l1 minimization
was already included in the appendix on optimization. That example was clearly
related to a basis pursuit (BP) problem.
Another idea of Nesterov was to substitute non-smooth objective functions by
suitable smooth approximations [134]. One application of this idea for our problem,
is to consider a smooth approximation of the l1 norm minimization as follows.
smoothed min x1 = min max uT x + u22 (2.43)
x x u 2
Based on this approach, a popular algorithm called NESTA was introduced in [21]. A
MATLAB implementation is available (see the section on resources). Other relevant
algorithms are TwIST [23], SALSA [5], and SpaRSA [186].
where f (x) and g(x) are closed proper convex functions. Both functions can be
non-smooth.
The ADMM method can be expressed as follows [145]:
Typically g(x) encodes the constraints and zk dom g(x) (the domain of g(x)).
On the other hand, xk dom f (x). Along the iterations, xk and zk converge to each
other. The ADMM method is helpful when the proximal mappings of f (x) and g(x)
are easy to compute but the proximal mapping of f (x) + g(x) is not so.
Equivalently, the method can be expressed in terms of an augmented Lagrangian,
starting from the following problem:
minimize f (x) + g(z) subject to x z = 0
It is shown in [145] that these three equations can be translated to the proximal
form introduced above.
The ADMM method has connections with classical methods, like the alternating
projections of von Neumann (see [9] for an interesting presentation). An extensive
treatment of ADMM, with application to Lasso type problems, can be found in [26].
Also, [94] includes accelerated versions, and application examples concerning the
elastic net, image restoration, and compressed sensing.
Some times, in problems of the type found in this chapter, it is convenient to
consider the indicator function. Given a set C, its indicator function i C (x) is 0 if x C,
and otherwise. The corresponding proximal is Pr ox f (x) = arg min u x22 =
u
PC (x), which is the projection over set C.
One of the applications of ADMM treated in [26] is the Basis Pursuit (BP) prob-
lem. This problem can be written as:
where f (x) is the indicator function of the set C = {x | Ax = b}. In this setting, the
ADMM iterations are the following:
where S(1/) is the soft thresholding operator, and the projection PC can be computed
as follows:
0.15
0.1
0.05
-0.05
-0.1
-0.15
-0.2
-0.25
0 10 20 30 40 50 60
n samples
2.5
2.4
2.3
2.2
2.1
1.9
1.8
0 50 100 150
n iter
aux2=A'*uA*b;
for nn=1:niter,
% x update
x=(aux1*(z-u))+aux2;
% z update
zo=z;
xe=(alpha*x)+((1-alpha)*zo);
aux=xe+u; imu=1/mu;
z=max(0,aux-imu)-max(0,-aux-imu); %shrinkage
% u update
u=u+(xe-z);
% recording
rob(nn)=norm(x,1);
end
% display
figure(1)
stem(x,'k');
title('the sparsest solution x');
xlabel('n samples');
figure(2)
plot(rob,'k');
title('evolution of norm(x,1)');
xlabel('n iter');
xk = Pr ox g (yk ) (2.55)
where k [ , 2 ].
2.2 Sparse Solutions 179
See the Thesis [74] for an extensive treatment of splitting methods. The history of
the Douglas-Rachford algorithm and other aspects of interest are concisely included
in [61]. There is a web page focusing on Douglas-Rachford (see the section on
resources).
An implementation example is provided in the next section for a compressive
sensing case.
There is a series of methods based on matching pursuit (MP). The basic idea is
simple and familiar. Imagine you have a puzzle with a certain number of pieces. To
assemble the puzzle, the usual procedure would be to select pieces that match with
certain empty regions of the puzzle. You select and place pieces one after one. This
methodology can be applied to many types of problems; actually, some authors refer
to it as a meta-scheme.
In the next sections we shall see the method being applied to dictionaries. If
you place yourself in the Fourier context, it would be natural to combine sinusoidal
harmonics. Or, perhaps, it would be better for your problem to combine arcs and line
segments to analyze image shapes, etc.
For now, let us focus on the problem considered in this section: to find a sparse
vector x such that:
Ax = b (2.57)
Most of the section has been centered on optimization. Matching pursuit is more
related with variable selection. Like in the LARS algorithm, one uses residuals and
select most correlated columns of A. After setting the initial residual as r0 = b and
a counter k = 0, the basic MP would be as follows:
This basic algorithm suffers from slow convergence. The orthogonal matching
pursuit (OMP) has better properties. Due to the orthogonalization, once a column
is selected, it is never selected again in the next iterations. The OMP algorithm starts
as the MP algorithm, and then:
Once the suitable columns of A were selected, a matrix B can be built with these
columns. Now, there is a theorem that establishes that OMP will recover the optimum
representation if:
max B + Ai 1 < 1 (2.58)
Ai
2.5
1.5
0.5
0
0 10 20 30 40 50 60 70 80
iteration number
The compressed sensing (CS) theory claims that it is possible to recover certain
signals from fewer samples than the Nyquist rate required by Shannon. This is based
on the observation that many natural signals, such sound or images, can be well
approximated with a sparse representation in some domain. For example, most of
the energy of typical images is preserved with less than 4 % of the wavelet coefficients
[153, 175].
The term compressed sensing was coined by Donoho [68] in 2006. Other impor-
tant contributions came, the same year, in [34, 41]. Since then, the theory has
advanced and nowadays there are books on the mathematics of CS, like [62, 82];
however, there are still open questions, as we shall see.
Instead of acquiring with a sensor large amounts of data and then compress it,
the new paradigm proposes to directly obtain compressed data. The recovery of the
signal would be done via an optimization process.
These ideas are supported by the scenario that has appeared several times in this
book, having the form of a system of linear equations A x = b. In particular, our
interest focuses on undetermined systems.
One of the peculiar aspects of CS is that it is convenient to select a random matrix
to be used as matrix A.
The idea of CS is based on the following fact: if you have a signal x (n samples) and
a m n matrix A, with m < n, you can write the following equation:
Ax = b (2.59)
b A
(m x 1) (m x n)
D
D
(n x L)
z
(L x 1)
If you have a signal z that can be expressed in a sparse format, like for instance
x = D z, then you can use b = A x = A D z for adhering to a compressed sensing.
A typical example would be a signal z that can be expressed with a few Fourier
harmonics. In general, D would represent the use of a dictionary; and the operation
D z could be regarded as a sparsification of the signal z. The diagram shown in
Fig. 2.12 illustrates the approach.
An analysis of these ideas under the view of information flow, would say that you
are exploiting the low-rate of information being transmitted by z; so, in fact, you
can concentrate the same information in b.
In relation with these points, it must be said that (strictly) sparse signals are rarely
encountered in real applications. What is more probable is to find signals with a small
number of elements significantly non-zero (the rest could also be non-zero). Such
signals are considered as compressible. One way of formalizing this concept is to
look for an exponential decay of values; the signal components are rearranged in a
decreasing order of magnitude |x(1) | |x(2) | . . . |x(n) | and then one checks
that:
|x(k) | R k 1/ p (2.60)
x xs 1 C p R s 11/ p (2.61)
x xs 2 D p R s 1/21/ p (2.62)
In order to apply the CS approach, one has to guarantee that the compressed signal
can be recovered. This is the question that we want to consider now.
A first and important point is that, while in regression problems you inherit a
given matrix A formed from experimental data, in the case of compressed sensing
you are the responsible of designing the matrix A. In consequence, it is convenient
to have some guidance for building A.
In relation to this aspect the design of Aan important step in the early devel-
opment of CS theory was to establish the restricted isometry property (RIP). Let
us introduce with some care this property.
Recall that a vector x is s-sparse if at most s of its entries are non-zero. A first,
natural question would be: how many samples are necessary to acquire s-sparse
signals? The answer is that m should be m 2s (note that the dimensions of A are
m n). As explained in [133], the sampling matrix must not map two different
sparse signals to the same set of measurement samples. Hence, each collection of 2s
columns from A must be nonsingular.
In this line of thinking, [35] considered that the geometry of sparse signals should
be preserved under the action of the sampling matrix. The s-th restricted isometry
constant s was defined as the smallest number s such that:
When s < 1, the expression above implies that each collection of s columns of A
is non-singular, so (s/2)-sparse signals can be acquired. In case s << 1, the action
of A would nearly maintain the l2 distance between each pair of signals (the term
isometry refers to keep the distance).
It is said that A satisfies the restricted isometry property (RIP) if it has an asso-
ciated isometry constant with value s < 1.
Now, let us see if we will be able to recover x from b. One of the alternatives
considered in CS is to state and solve a BP problem:
According with [37], having got the BP solution x , this solution recovers x
exactly if the signal is sufficiently
sparse and the matrix A has the RIP property.
Moreover, assume that 2s < 2 1, then (theorem):
x x C0 x xs 1 (2.64)
1
and:
x x C0 s 1/2 x xs 1 (2.65)
2
2.3 Compressed Sensing 185
where xs is the vector x with all but the s-largest entries set to zero. See [37] for the
constant C0 , which is rather small. In case that x was s-sparse, the recovery is exact.
Two consequences of the theorem are that:
Part of the research is trying to find larger values of s still guaranteeing the
recovery. For instance, recently (year 2013), a value of 0.5746 (instead of 0.4142)
has been established in [202]. This paper includes a table, with key references, of
the previous results from other specialists.
Normally, one wants a matrix A with a small s . It has been found that many types
of random matrices have very good s . Often a value s 0.1 can be obtained. On
the contrary, it is difficult for deterministic matrices to have a good s .
Random matrices can be obtained in several ways. Here is a short selection:
Gaussian matrices: the entries of A are independent normal variables, with zero
mean and 1/m variance; that is: ai , j N (0, 1/m)
Bernouilli matrices: the entries of A are:
+1/ m, with probabilit y 1/2
ai , j = (2.66)
1/ m , with probabilit y 1/2
Partial Fourier matrices: the entries of A are random samples of a Fourier matrix
See [42] and [133] for more alternatives, and details. A contemporary research
topic is to deterministically generate matrices with good RIP.
Supposing that the signal recovery would be done with BP, it is important to
consider also the null space property (NSP). Let us introduce this property.
Recall that the null space Nul(A) of the matrix A is the set of al solutions of
A x = 0.
Consider the set of indexes L = {1, 2, . . . n } and choose for example a subset
S L with |S| = s. Then, any vector x can be written as the sum x S + x S , where
S is the complement of S, the non-zero elements of x S are in S, and the non-zero
elements of x S are in S. For example, suppose that x = {4, 3, 2, 1, 5, 6 }, a possible
decomposition could be: x S = {0, 0, 2, 1, 5, 0 } and x S = {4, 3, 0, 0, 0, 6 }.
The matrix A satisfies the null space condition of order s if for any non-zero vector
x N ul(A) and any index subset S with |S| = s, one has:
x S 1 < x S 1 (2.67)
Intuitively, the NSP implies that non-zero vectors in the null space cannot be too
sparse. The problem we want to avoid is having a sparse vector with A x = 0, which
would clearly interfere with the recovery of other vectors.
186 2 Sparse Representations
When dealing with exactly sparse vectors, the spark characterizes when recovery
is possible. However, when dealing with approximately sparse signals (for instance
when there is noise), a more restrictive condition about the vectors in Nul(A) is
needed, and this is why the NSP [64]. In relation with RIP, it can be shown that RIP
implies NSP; that is: RIP is stronger than NSP.
A theorem establishes that BP will obtain exact recovery iff A satisfies the NSP
[3, 157].
For the interested reader it could be opportune to consult [179] about the relation-
ship between the irrepresentable condition, already mentioned in the Lasso context,
and the RIP. The irrepresentable condition is more restrictive.
Nowadays, experience with some applications is showing that the RIP condition
which is a sufficient condition is too stringent. The theory is advancing, and new,
weaker conditions are being discovered or re-discovered [4].
As an example of sparse signal recovery, suppose that one has a sparse 1D signal,
then one uses a sensing matrix A to obtain samples (a vector b), and then applies the
Douglas-Rachford algorithm to recover the original signal. This example is borrowed
from a contribution of G. Peyre to the MATLAB Central file exchange site (Toolbox
of Sparse Optimization).
Program 2.5 handles this example and, at the same time, provides an example of
implementation of the Douglas-Rachford algorithm.
Figure 2.13 shows the original signal to be sampled.
The signal recovery can be considered as the following problem:
0.8
0.6
0.4
0.2
xk = Pr ox f (yk ) = yk + A T (A A T )1 (b A yk ) (2.68)
(notice that f (x) and g(x) have been exchanged with respect to the description of
the algorithm given in the previous section; both are equivalent).
Figure 2.14 shows the recovered signal.
The evolution of the algorithm can be followed in several ways. For example,
Fig. 2.15 shows the evolution of x1 along the iterations.
0.8
0.6
0.4
0.2
20
18
16
14
12
0 10 20 30 40 50 60 70 80 90 100
188 2 Sparse Representations
Some of the CS authors have proposed a kind of uncertainty principle, in the vein
of the time-frequency duality [42, 76]. It is illustrative to have a quick look to this
aspect.
Two important examples of orthogonal matrices are the identity matrix I and the
Fourier matrix F. Both matrices correspond to two orthobases, one allows for time-
domain representation of signals, and the other for frequency-domain representation.
More in general, given two orthobases and , the signal b can be represented as
follows:
b = = (2.70)
Suppose that the matrix A was the concatenation of two orthogonal matrices and
. A particular example could be A = [I, F]; a sparse approximation of the signal
b would be a superposition of spikes and sinusoids.
The mutual-coherence (A) is defined as follows [76]:
where i and j are columns. It can be shown that (1/
n) (A) 1. For the
case A = [I, F], the mutual-coherence is (A) = (1/ n).
If the two orthogonal matrices were not normalized, the mutual coherence would
be expressed as follows:
| iT j |
(A) = max (2.72)
i, j i 2 j 2
The interesting fact is that, according with a theorem [76], one has:
2
0 + 0 (2.73)
(A)
Therefore, if the mutual-coherence of two bases is small, then and cannot both
be very sparse. This can be regarded as an uncertainty principle. In particular, a signal
cannot be sparsely represented both in time and frequency.
There is a theorem in [42] establishing the following: suppose the signal x is
s-sparse and take m samples of it. Then if:
for some C > 0, then BP will exactly recover from b = A x the signal x with very
high probability.
In consequence, the smaller the coherence, the fewer samples are needed [42].
Then, in general, incoherent sampling would be recommended.
190 2 Sparse Representations
Usually, signals are not exactly sparse but compressible. Given a compressible signal
x, it can be approximated by a sparse signal xs being the vector x with all but the
s-largest entries set to zero, with the following error:
= x xs 1 (2.75)
b = Ax + e (2.76)
This problem is called basis pursuit denoise (BPDN), and it was proposed in
the famous paper of [55]. Clearly, the problem looks like a Lasso problem. In fact,
BPDN became so popular that the literature sometimes use the term BPDN to refer
to Lasso.
Assuming tha t2s < 2 1, one of the theorems presented in [37] establishes
that the recovery obtained by BPDN solution obeys:
x x C0 s 1/2 + C1 (2.77)
2
for constants C0 , C1 > 0. A small C0 would mean that the recovery is stable with
respect to inexact sparsity; and a small C1 would mean that the recovery is robust
with respect to noise. It is established in the theorem that both constants are rather
small. For example, with 2s = 1/4 the values of the constants would be C0 5.5
and C1 6, [42]. In the noiseless case, the solution could be found with BP, and will
have the same value of C0 .
Some literature is considering alternatives to BPDN. One of these alternatives is
the Dantzig selector, which was proposed by [36] for cases where the number of
2.3 Compressed Sensing 191
variables is much larger than the number of observations. The problem is formulated
as follows:
minimize x1 , subject to A T (b Ax) c
0.5
0
0 0.5 1
d
192 2 Sparse Representations
Images can be treated with the Fourier transform or wavelets with a number of pur-
poses, like compression, filtering, denoising, deblurring, etc. This has been already
treated in different chapters of this book.
However, processing innovations do not stop. Several new ways of decomposing
an image into components have been proposed, opening exciting possibilities. Many
publications appeared, mainly in two directions: one focuses on edges and the rest,
and the other on a kind of spatial dictionaries.
The objective of this section is to introduce with some detail these new approaches,
indicating a number of references that can be useful for the reader to dig into the
proposed methods.
f = u +v (2.78)
There are authors that prefer other terms, like geometry or structure, for the u term.
The v term could include constant color regions, textures that could be modeled with
oscillatory functions, and other. Textiles can usually be modeled with oscillatory
functions.
Of course, it would be always be possible to extract a u component from f , and
then obtain v as f u. However, it is more compliant with the purpose of sparse
representation to try a model based approximation of v.
Notice that image denoising considers that there is an original image p and a noise
w, so one has a noisy image q = p + w. The target of the denoising effort is to
extract p. Compared with the texture + cartoon decomposition, there are similarities
that could be more or less strong depending on the particular problem to be tackled.
This should be taken into account when looking at the literature.
It has been pointed out that a decomposition f = u + v seems to be analogous to
the high-pass and low-pass decompositions described in previous chapters. However,
this will not work as we want: both cartoon and texture contain high frequencies, so
a linear filtering cannot separate u and v. Other alternatives should be explored, like
for instance variational approaches.
In general, variational approaches try to obtain u as a function that minimizes a
certain criterion, which is expressed as an image model.
Looking at the recent history of variational decompositions, as summarized in
[29], an important proposal was done in 1989 by Mumford and Shah [130]. It is
worthwhile to consider in some detail this contribution.
It may well happen that in a certain photograph one wants to separate objects of
interest. For instance, cells (recall the example about thresholding in Chap. 11), roads,
faces, etc. This is the type of applications addressed by image segmentation. It leads
typically to the detection of edges, and so it has many things in common with f =
u + v decomposition.
Suppose a rectangular picture with multiple objects O1 , O2 , . . . , On . There
would be a set of regions 1 , 2 , . . . , n in the image corresponding to these
objects. Denote as the set of smooth arcs that make up boundaries for the regions
i . In total one has: = 1 2 . . . n . Our main goal is to capture
the boundaries while the texture does not vary much inside each object. The image
f could in this case be approximated by a piecewise-smooth function u.
Information is related to changes. With a certain analogy, one could speak of
energies in the sense of information content. Using the Sobolevs H 1 norm, the
energy of a region i would be:
2.4 Image Processing 195
2
E(i ) = u(x) dx (2.79)
i
(the function u must belong to the space H 1 of functions whose derivative is square-
integrable).
It is known that functions belonging to H 1 cannot present discontinuities across
lines, such as boundaries. Therefore, u alone cannot model the boundaries.
It is opportune to consider the energy of the boundaries, as follows:
E( ) = Length ( ) (2.80)
In a most cited, seminal article [130], Mumford and Shah approximated the image
f by a piecewise-smooth function u and a set that solves the following problem:
arg min Length( ) + | u(x)|2 dx + ( f (x) u(x))2 dx
u,
(2.82)
Clearly, it is an energy minimization. It can be shown that removing any of the three
terms gives a trivial, not suitable solution.
The observed result of the minimization is usually a simplified version of the
original image, similar to a cartoon. It is not a perfect tool; for instance, shadows,
reflections, gross textures, may cause difficulties.
Mainly because the term with , it is not easy to solve the minimization problem.
Many methods and approximations have been proposed, see [33, 66, 152] and ref-
erences therein. An Octave implementation is included in [178]. There is a popular
related segmentation method based on the Chan-Vese model; this model is described
in [89] with an implementation in C available from the web.
An example of image segmentation using the Chan-Vese method will be given
below. In preparation for the example, let us include a brief summary of this method.
An important simplification is that u(x) can take only two values: cin for x inside ,
and cout for x outside . These constants can be regarded as average gray values:
f dx f dx
cin = in
cout = out (2.83)
in dx out dx
The first integral in the previous equation corresponds to the length of . In some
cases it would be convenient to add a term with the area enclosed by , to control
its size, but it has been not considered in our example.
A semi-implicit gradient descent could be applied for the minimization. This is
done with an iterative evolution of:
x x 2y 2 x y x y + yy 2x
Q=
(2x + 2y )3/2
= () Q in ( f cin )2 + out ( f cout )2 (2.85)
t
(where sub-indexes represent partial derivatives).
The () can be approximated with:
() = (2.86)
(2 + t 2 )
50
100
150
200
250
50 100 150 200 250
2.4 Image Processing 197
50 50
100 100
150 150
200 200
250 250
50 100 150 200 250 50 100 150 200 250
Q3=Phi([1,1:m-1],[2:n,n]); Q4=Phi([2:m,m],[1,1:n-1]);
Phi_xy = (Q1 + Q2 - Q3 - Q4)/4;
%TV term
Num = (Phi_xx.*Phi_y.^2 )- (2*Phi_x.*Phi_y.*Phi_xy) +
(Phi_yy.*Phi_x.^2);
Den = (Phi_x.^2 + Phi_y.^2).^(3/2) + a;
%Compute averages
c_in = sum([Phi>0].*I)/(a+sum([Phi>0]));
c_out = sum([Phi<0].*I)/(a+sum([Phi<0]));
%Update
aux=( Num./Den - lambda*(I-c_in).^2 + lambda*(I-c_out).^2);
Phi = Phi + dt*epl./(pi*(epl^2+Phi.^2)).*aux;
end;
% display of results
figure(1)
imagesc(I);
title('Original image');
colormap gray;
figure(2)
subplot(121);
imagesc(Phi);
title('Level Set');
subplot(122);
imagesc(I); hold on;
title('Chan-Vese Segmentation');
contour(Phi,[0,0],'m');
colormap gray;
Further steps in the direction suggested by the Mumford-Shah model were taken by
considering a total variation (TV) term. Already, a brief introduction to TV has been
done in the previous book, in the context of image restoring. It is now opportune to
include a more extended consideration.
There are excellent publications, with a mathematical formal orientation like
[47, 164], that use test functions belonging to the set C01 ( , 2 ) of continu-
ously differentiable vector functions of compact support contained in , and such
1. These functions are employed for the definition of TV as follows:
T V (u) = sup { u div dx , x } (2.87)
Given a differentiable function u defined on a bounded open set , its total varia-
tion is:
2.4 Image Processing 199
T V (u) = | u(x) | dx (2.88)
Bounded variation functions can have sharp edges. Actually, the norm BV takes
into account the number of edges.
If a function u BV belongs also to the smaller Sobolev space, then the norm is
just T V (u) . See [59] for an interesting study of BV functions, including wavelets.
The BV functions play an important role in the cartoon + texture decomposition.
In 1992 Rudin, Osher and Fatemi [163] proposed to apply TV for image denoising,
in a variational framework with the following expression:
inf |u(x)| dx + ( f (x) u(x))2 dx (2.90)
u
where u BV .
Evidently, the ROF model (enclosed in braces) is composed of a TV term and a
fidelity term. This model has been cited in more than six thousand papers.
The TV term removes noise or small details, while preserving edges. The authors
of the ROF model give in [163] an algorithm for computing the adequate value of
if the noise level was known. In other cases this value has to be chosen, considering
that it determines in some sense the smallest image feature to be kept.
From the point of view of optimization, the good news is that the ROF denoising
problem is convex, so the solution exists in BV and is unique [164]. A detailed study
of image recovery via TV minimization is [47]. The field of TV in imaging, including
algorithms, is reviewed by [45].
There are a number of observed problems when adhering to the ROF approach, as
described in [48]: loss of contrast, loss of geometry, staircasing, and loss of texture.
Part of the recent developments cited in [48] are oriented to solve these problems.
The ROF variational method can be adapted to different types of noise, as intro-
duced in [90]. This paper is plenty of practical numerical and analytical details, and
contains a link to one implementation in C code.
Next section includes an example of ROF-TV image denoising accompanied with
a MATLAB program.
200 2 Sparse Representations
Meyer suggested to replace the l2 norm in the fidelity term with a weaker norm
more adequate for modeling textures or oscillatory patterns. Hence, he proposed the
following minimization problem:
inf |u(x)| dx + f (x) u(x) (2.91)
u
where u BV .
Continuing with his approach, Meyer defined the space G, which is the Banach
space of all generalized functions v that can be written as v = div(g), where g =
(g1 , g2 ) and g1 , g2 l (). The space G is endowed with the G-norm, which is
defined as the lower bound of all l () norms of the functions |g|, with the infimum
being computed over all decompositions of v .
The space G is the dual of the closed subspace BV of BV. When applying the
G-norm in the minimization problem, the second term is f (x) u(x)G . Meyer
also defined two more spaces: E and F, see [128]. The spaces are related as follows:
BV G E F.
G-functions may have large oscillations and nevertheless small norms, which is
suitable for the minimization to preserve textures.
Meyer did not propose any numerical procedure for the decomposition. A first
algorithmic contribution to this aim was made in [180] (the Vese-Osher model), which
soon was followed by other proposals, like [10, 193]. In [29] a simple conversion of
a linear filter pair into a nonlinear filter pair was proposed, obtaining a fast separation
of cartoon and texture.
It was suggested in [135] to replace the l2 term in the ROF model by a l1 term. This
article, year 2004, contains interesting references from the 90s about the fidelity
term and how to avoid outliers. According with this approach, the functional to
minimize is:
inf |u(x)| dx + dx (2.92)
u | f (x)u(x) |
As shown in [135], the l1 norm is well suited to remove salt and pepper noise.
Further analysis by [49, 193], shows that the model enjoys interesting properties of
morphological invariance and texture extraction by scale, so geometrical features
are better preserved. A first, fast algorithm for solving the optimization problem was
presented in [12].
All three models, Meyer, Vese-Osher, and TV-l1, were compared by [194] using
a uniform computing approach. The three models were solved as second-order cone
programs (SCOP). The comparison refers to 1D signals and 2D images. Also, [194]
contains a detailed history, with references, of alternatives for solving the three
models.
2.4 Image Processing 201
See [109] for a practical treatment of the TV-l1 decomposition, including various
examples and a link to software from a web site.
where H is some Hilbert space. This generalization can include a number of different
models, and in particular the ROF model. One of the contributions of [12] is a Hilbert
space of Gabor wavelets.
From time ago there was interest on texture modeling and analysis for different
purposes. A brief review of invariant texture analysis methods is offered in [198].
With the advent of the cartoon-texture approach, more research has been devoted
to texture models favoring better image decompositions; see [125] for a modern
perspective involving a decomposition of the functional into three terms: the fidelity
term, a cartoon term, and a texture term.
As said before, it has been observed that the TV term induces image staircasing
effects. A remedy for this problem was introduced in [47], by including higher order
derivatives in the energy.
Several versions of the cartoon-texture decompositions were proposed in [50], by
combining the improved cartoon term of [47] and three alternatives for the texture
term (the third alternative considers texture + noise). This article is particularly
interesting in several ways: discussion, formulae, and experimental results.
See [110] for an interesting work on second order TV, and [121] for combining
TV and a fourth-order partial derivative filter.
Let us briefly collect a number of references that deal with important aspects of the
methods already introduced.
2.4.2 Patches
Science has surprising connections between seemingly disparate fields. For instance,
it happens that the term sparse coding also belongs, from the 90s, to Neuroscience
research.
As it will become clear soon, this biological aspect deserves some attention now.
A convenient guide is offered by the review done in [171], in special connection with
the work of Olshausen. Many references in [171] quote observations and conjectures
made by Barlow in the 60s.
Images are captured by the retina, transmitted to the LGN, and then to the area V1 of
the visual cortex; subsequent areas are V2, V4, MT, and MST. It has been reported that
nobody has been able to reconstruct the input image from the recordings of neurons in
V1. Cells in the visual cortex were classified as simple, complex, and hyper-complex
cells. Simple and complex cells are sensitive to specific stimuli orientations.
One of the Barlows hypotheses is that the role of early sensory neurons is to
remove statistical redundancy of the input. Hence, it is not strange that the first section
of [171] was devoted to PCA and ICA. This also corresponds to an important direction
of the research, which assumes that sensory neurons are specially adapted to the
statistical properties of those signals that occur more frequently. So, it is important to
investigate the relationship between natural signals and neural processing. In general,
natural images are not Gaussian and there are significant spatial correlations.
Another observation of Barlow was that neurons at later stages of processing are
generally less active than those at earlier stages. It seems that we may model what
happens in visual areas using multiple stages of efficient coding. This is, indeed, an
ambitious objective. Most initial steps of the research have focused on retina and the
first visual cortex areas.
Concerning the retina, it has been shown that the single-cell physiology and con-
trast sensitivity functions are consistent with the product of a whitening filter and a
Wiener filter for noise removal and adaptation to mean luminance level (see [171]
and references therein). Because non-Gaussianity, a whitened natural image still has
lines, edges, contours, etc.
It seems that there is efficient coding at the retina level. The next question is if we
have this in V1.
In a famous letter to Nature in 1996, [138] proposed to represent an image as a
linear superposition of 2D basis functions, which are image patches extracted from
natural scenes. The representation has a conventional form:
I (x, y) = ai i (x, y) (2.94)
i
2.4 Image Processing 203
The set of basis functions emerged after training on many (in the order of 105 ) image
patches randomly extracted from natural scenes. The training tried to maximize the
sparsity of the representation, searching for components that are both sparse and
statistically independent (in the ICA sense). Actually, [138] shows an example of
192 basis functions, which are 16 16 pixel patches, extracted from 512 512
natural images. These functions resemble the spatial receptive field properties of
simple cells, they are spatially localized, oriented, and band-pass.
Figure 2.19 shows an example of a patch dictionary, with 256 patches of 8 8
pixel each.
In 1997, [139] made the hypothesis that V1 employs sparse coding with an over-
complete basis set.
The principle behind this kind of coding is that it tries to represent each image
in terms of a small set of functions, chosen from an overcomplete set. Only a few
neurons need to be active and expend energy. Actually, it has been estimated that at
any given moment only 1/50th of the cortical neurons could afford to be active, due
to energy constraints (see [140] and references therein).
It is being found that sparse coding is also employed by other senses and other
neural functions [140]. Part of current research is considering space-time statistics,
using natural environment movies as inputs [31, 171].
Back to our signal processing atmosphere, the lessons learned from neural processing
are summarized in [162] in the context of a history of transform design that includes
Fourier, wavelets, etc. This summary put in contrast analytic versus trained dictionar-
ies. Analytic dictionaries are linked to harmonic analysis, use pre-defined classes of
functions, and are usually too simplistic for the nature complexity. Machine learning
204 2 Sparse Representations
assumes, instead, that the structure of complex phenomena can be more accurately
extracted directly from the data.
Of course, if one adheres to the use of learned dictionaries, then a training method
should be devised. In his review, [161] identifies five available methods. Perhaps the
most popular of them is K-SVD, which will be described in more detail below. The
other methods are: generalized PCA, union of orthobases, the method of optimal
directions (MOD), and parametric training methods. References are provided for all
methods. The advantage of parametric training is that structured dictionaries were
obtained.
Given a set of training column data vectors, y1 , y2 , , y N , the problem is to
find a (m K ) dictionary D, with K << N , such every yi is a sparse combination
of elements of D. Note that the data vectors yi could contain 1D signals or vectorized
image patches.
Figure 2.20 shows a diagram corresponding to the problem. The representation
matrix X should be sparse (with sparsity s).
In mathematical terms, the problem to solve is:
min yi D xi 2 , subject to xi 0 s
D, X
i
Y D
(m x N) (m x K)
X
(K x N)
There is a general iterative scheme that can be used for solving the problem [44].
Each iteration includes two steps:
Sparse coding, for finding a minimizing X (for a fixed D)
Dictionary update, finding a minimizing D (for a fixed X )
The K-SVD algorithm is described in detail in [6], which is a most cited article. It
can be considered as a generalization of the K-means clustering algorithm, already
seen in the chapter on data classification.
As MOD and other methods, K-SVD iterates two steps: sparse coding and dic-
tionary update. Sparse coding can be done by any suitable method; K-SVD focuses
on the dictionary update. For this update, assume that both D and X are fixed and
select one column dk of D; then, select the k-th row of X , which will be denoted as
xkT . Now, following [6], let us derive a convenient expression:
2 2
j j
Y D X 2F
= Y d j xT = Y d j x T dk xkT =
j j=k
2 F F
= E k dk xkT F
(2.96)
where E k is an error matrix.
Denote as k the group of indices pointing to vectors yi that use dk (those where
xkT (i) is nonzero). The group has a certain number L of indices.
Also, denote as k a matrix with ones on the (k (i), i) entries, and zeros else-
where. This matrix will be used for shrinking purposes. For instance, the result of
xkR = xkT k is a row vector xkR of length L, which is obtained from xkT by discarding
the zero entries. Similarly, E kR = E k k is the set of error columns corresponding to
examples that use dk .
Then, for the dictionary update one has to minimize:
E k k dk x K k 2 = E R dk x K 2 (2.97)
T F k R F
Here, one can use SVD (singular value decomposition) for getting the desired solu-
tion. The matrix E kR is decomposed as E kR = U V T . The solution for dk is the
first column of U , and for the vector xkR , the first column of V multiplied by the
singular value (1, 1).
206 2 Sparse Representations
The methods already described, like MOD or K-SVD, learn a dictionary D to sparsely
represent the patches of an image, rather than the whole image itself [189]. In an
important article, [78] a proposal was made in the context of image denoising.
2.4 Image Processing 207
Fig. 2.21 Original picture, and image with added Gaussian noise
Overlapping patches were used. The idea was to denoise each patch via sparse cod-
ing, and then estimate the total image as the average of the patches together with the
observed noisy image. Actually, [78] adopted a Bayesian perspective, by defining a
global image prior that forces sparsity over all patches. Further elaboration in this
line was presented in [77, 79].
In the case of denoising and other applications (impainting, deblurring, etc.) it is
natural to consider a Bayesian treatment. If a see a noisy or corrupted image I would
say: this image is not clean. So, I expected something cleaner. In consequence, I have
a kind of model (a prior) of what an image should be.
See [189] for a fast method for whole image recovery using patch-dictionary. It
has an associated web page with MATLAB code.
208 2 Sparse Representations
Consider the example depicted in Fig. 2.24, which is a synthetic image that may be
similar to an astronomical picture (stars and filaments).
The stars could be well represented using wavelets, while for the lines ridgelets
are more suitable. Then, it seems appropriate to use two dictionaries, wavelets and
ridgelets, for representing the image. A decomposition of the image into two com-
ponents would be possible, one component with the stars and the other with the
lines.
2.4 Image Processing 209
Although simplistic, this example illustrates well the idea of morphological com-
ponent analysis (MCA), [172, 173]. As summarized in [24], the MCA method relies
on an iterative thresholding algorithm, with a threshold that decreases linearly along
iterations. Let us describe the algorithm, based on [24].
The case considered is a signal consisting of a sum of K signals yi , having different
morphologies. A dictionary of bases {1 , 2 , . . . K } is assumed to exist. Signal
y1 is sparse in 1 ; signal y2 is sparse in 2 ; and so on. Denote i = iT yi .
Suppose, for simplicity, that one has only two signals (K = 2), so y = y1 + y2
(the results can be easily generalized to more components). It is proposed, in order
to estimate the components of y, to solve the following minimization:
min 1T y1 1 + 2T y2 1 , subject to y y1 y2 2
y1 , y2
where is the noise standard deviation. Continuing with a simplified view, assume
for now that = 0.
The first step of the MCA algorithm sets the number of iterations Imax , the min-
imum threshold min , initial estimated values of y1 and y2 , and the thresholds (1)
1
and (1)
2 .
Then, iterations begin:
While the two thresholds are higher than a lower bound min , do:
k=2
Compute residuals, for j = 1, 2, using current estimates yi(k1) of the components:
r (k) (k1)
j = y yi= j (2.98)
y (k)
j = j (k)
j (2.100)
(1)
j min
(k+1) = (1) k (2.101)
j j
Imax 1
If there is no noise, min should be set to zero. On the other hand, when there is
noise, min should be set to a few times (see [24]).
The MCA algorithm provides a good components separation when the i , the
members of the dictionary, are mutually incoherent enough. In the examples provided
by [173], textures are treated with DCT (discrete cosine transform) since DCT is
suitable for the representation of natural periodicity (in case of non-homogeneous
textures, local DCT could be used). As said before, lines are well represented with
ridgelets. In addition, the curvelet transform represents well edges in images.
One of the contributions of [24] with respect to the original MCA, is a method
called mean-of-max (MOM) for the decrease of thresholds. Denote:
m j = Tj r (k) ,
1
(k)
j = (m 1 + m 2 ) (2.102)
2
It is also shown in [24] that MCA/MOM is clearly faster than BP (basis pursuit),
being at least as efficient as BP in achieving sparse decompositions in redundant
dictionaries.
See the book [174] for an extensive treatment of sparse representation and process-
ing, including chapters on wavelets, ridgelets, curvelets, etc. In particular, the chapter
18 of that book focuses on MCA, using the MCALab package (in MATLAB) by the
same authors. Several interesting application examples were presented.
A simple 1D example of MCA processing is the case of a signal composed of two
sine signals with close frequencies, and three spikes at random positions. The target
of MCA is to separate the signal morphological components. We chose this example
and prepared a simplified version of MCALab for this case. Only two dictionaries
were considered, one based on discrete cosine transform (DCT), and the other for
2.4 Image Processing 211
Dirac pulses. The simplified version is the Program 2.7. In this program, the initial
value of was estimated using the finest details obtained by Daubechies wavelet.
During the execution of the program, the selection process is visualized with an
animated figure. Figure 2.25 shows an example. When the program stops, another
figure is shown with the original signal and its components (Fig. 2.26). Then, the user
can see how good was the MCA work, by comparing the original and the separation
result.
-1
-2
0 100 200 300 400 500 600 700 800 900 1000
-1
-2
0 100 200 300 400 500 600 700 800 900 1000
1.5
0.5
0
0 100 200 300 400 500 600 700 800 900 1000
-1
-2
0 100 200 300 400 500 600 700 800 900 1000
-1
-2
0 100 200 300 400 500 600 700 800 900 1000
1.5
0.5
0
0 100 200 300 400 500 600 700 800 900 1000
%analysis:
x=zeros(4*nv,1); L=2*N;
x(2:2:L)=Ra(:); z=fft(x);
Ca= [struct('coeff',[]) struct('coeff',real(z(1:nv))./Ko)];
%thresholding (not the low frequency components):
cf = Ca; ay=cf(2).coeff;
cf(2).coeff=ay.*(abs(ay)>delta);
Ca = cf;
%synthesis:
c=Ca(2).coeff; lc=length(c); M=lc/qq;
fu=(1:(lc-1))';
Ku=[M; 0.5*(M+sin(2*pi*fu/qq)./(2*sin(pi*fu/lc)))];
Ku=Ku.^0.5;
c=c./Ku;
x=zeros(4*lc,1); L=2*M;
x(1:lc)=c; z=fft(x);
y=real(z(2:2:L));
part(:,1)=y(:)/qq; %output
% Dirac part--------------
Ra=part(:,2)+residual;
%analysis:
Ca= [struct('coeff',[]) struct('coeff', Ra(:))];
%thresholding (not the low frequency components):
cf = Ca; ay=cf(2).coeff;
cf(2).coeff=ay.*(abs(ay)>delta);
Ca = cf;
%synthesis:
part(:,2)=Ca(2).coeff(:); %output
% Update parameters---------------
delta=delta-lambda; %linear decrease
% Display along the process
nfg=nfg+1;
if nfg==4,
nfg=0; %restart counter
figure(1)
subplot(3,1,1)
plot(sum(part(1:N,:),2));axis tight;drawnow;
title('sum of detected parts')
axis([0 N -maxp maxp]);
subplot(3,1,2)
plot(part(1:N,1));axis tight;drawnow;
title('the sum of 2 sines')
axis([0 N -maxp maxp]);
subplot(3,1,3)
plot(part(1:N,2));axis tight;drawnow;
title('the 3 spikes at random')
axis([0 N 0 maxp]);
end
end
2.4 Image Processing 215
part = part(1:N,:);
% Final display-----------------------------------
%Original signals
figure(2)
subplot(3,1,1)
plot(signal);axis tight;
title('the original composite signal')
axis([0 N -maxp maxp]);
subplot(3,1,2)
plot(ys);axis tight;
title('the original sum of 2 sines')
axis([0 N -maxp maxp]);
subplot(3,1,3)
plot(yk);axis tight;
title('the original 3 spikes at random')
axis([0 N 0 maxp]);
Some of the topics to be treated in this chapter require new concepts and tools. For
instance, the functions provided by MATLAB for sparse calculus, and concepts and
solution techniques related to certain optimization or signal processing problems.
In particular, it has been found that the Bregman algorithms are very suitable for
optimal compressed sensing and other applications.
There are two matrix storage modes in MATLAB. Full storage is the default and
stores the value of each element. Sparse storage that should be explicitly invoked-
stores only the values of nonzero elements.
The function sparse( ) can be used to create a sparse matrix. For example, if you
write:
S = spar se([], [], [], 1000, 1000];
An empty sparse matrix S is created. Then, you can specify some nonzero
elements, for instance:
The function nnz(B) returns the number of nonzero elements of the matrix B,
which can be in full or in sparse format. The function find(B) returns all (i, j) indices
of nonzero elements. The function nonzeros(B) returns all the nonzero elements.
As with full matrices, with sparse matrices you also can use the expression:
216 2 Sparse Representations
x = A\b
In many applications is more convenient the use of sparse matrices than full
matrices. For instance, in case of having a 10,000 10,000 matrix with 20,000
nonzero elements. Solving Ax = b will be much faster with A sparse matrix than
with A full matrix.
Many more details of the use of sparse matrices in MATLAB can be found in [91].
As a first, simple example, the Program 2.8 creates a tri-banded sparse matrix,
that you can list using full(A) without semicolon, and then applies spy( ) to display
(Fig. 2.27) the matrix structure (non-zero entries).
6
0 1 2 3 4 5 6
nz = 13
2.5 An Additional Repertory of Applicable Concepts and Tools 217
20
30
40
50
60
0 10 20 30 40 50 60
nz = 180
Since it is a popular example, MATLAB includes data for the Bucky ball, which is
composed of 60 points distributed as in a soccer ball. Each point has three neighbours.
The Bucky ball models the geodesic dome made popular by Buckminster Fuller, and
also the C60 molecule (a carbon molecule with 60 atoms). The Bucky ball adjacency
matrix is a 60 60 symmetric matrix.
Figure 2.28 shows the spy( ) visualization of the Bucky ball matrix.
From the data given by MATLAB it is also possible to get coordinates, and to
plot with gplot( ) the Bucky ball. Both Figs. 2.28 and 2.29 were generated with the
Program 2.9.
0.6
0.4
0.2
-0.2
-0.4
-0.6
-0.8
-1 -0.5 0 0.5 1
218 2 Sparse Representations
400
600
800
1000
1200
Among the sources of data on Internet, there are two important collections of
sparse matrices, one in the Matrix Market and the other in the University of Florida
Sparse Matrix Collection. From this second source, we chose the matrix HB/nnc1374
as an example. It is a matrix with 1374 1374 entries, and only 8588 non-zero entries.
Notice that earlier versions of MATLAB cannot handle such a large matrix.
Figure 2.30 displays the spy( ) diagram of this matrix. The figure has been gen-
erated with the Program 2.10, which shows how to extract the matrix from the data
structure downloaded from the Internet repository.
2.5 An Additional Repertory of Applicable Concepts and Tools 219
2.5.2 Diffusion in 2D
Let us begin with a kind of strange example. Suppose you want to paint your inclined
roof. You could come to the top edge and just drop the paint, letting the colour flow
down. After some evolution time, you could expect some uniformity. Of course, this
is not a recommended method, but it gives you the flavour of what will be introduced
next for denoising, impainting and other applications. Notice that the colour flow
will be mostly unidirectional as governed by the slope.
Another example would be the following: you take a thin aluminium plate, put
a flame below and near the centre during some time, and observe how the heat
flows from the centre toward the borders of the plate. The heat diffusion will be
omnidirectional.
As will be seen next, the heat diffusion can be related with 2D Gaussian filtering,
considering the image intensity as analogous to energy. The image filtering would
take some time (some iterations), to let the diffusion evolve.
In the case of denoising, one wants to eliminate the noise using diffusion, but, at
the same time, one wants to preserve edges. Is this possible? This also will be treated
next.
Since diffusion directions would become important, the proper mathematical tool
should be partial differential equations (PDE). Let us advance some important basic
equations concerning diffusion and heat.
Denote (x, y, t) the density of the diffusing substance at a given position and
time. The diffusion equation is:
= [D(, x, y, t) ] (2.103)
t
where D() is the diffusion coefficient (or diffusivity). This coefficient could be a
scalar, a scalar function of coordinates (non-homogeneous diffusion), or a tensor
(which could correspond to anisotropic diffusion). In 2D this tensor is a symmetric
positive definite matrix. The equation becomes non-linear if the coefficient depends
on .
220 2 Sparse Representations
= D 2 (2.104)
t
In order to discretize the equation for computing the heat diffusion on a grid, one
could approximate the second derivatives as follows:
2 2
These expressions are to be introduced in: 2 = x 2
+ y2
For the first derivative, one could use a first-order approximation:
(t + dt) (t)
= (2.107)
t dt
A simple example has been devised, in which a central region of a plate is heated,
and then the heat diffusion takes place along time. The computation of the diffusion
on a grid has been done with the Program 2.11. For a simpler notation, the heat has
been denoted as u. Figure 2.31 shows a sequence of plots, from left to right and top
to bottom, corresponding to the process evolution. The program execution may take
several minutes.
2 D dt
1 (2.108)
min((d x)2 , (dy)2 )
otherwise, the numerical scheme becomes unstable (as you may want to check).
In the case of heat diffusion with homogeneous Neumann boundary conditions and
D = 1, one has the following problem:
= 2 (2.109)
t
222 2 Sparse Representations
(x, y, 0) = 0 (2.110)
= 0 , (x, y) b() (2.111)
n
where b() is the boundary of the region of interest (usually a rectangle in the
case of a picture), and n is the normal to this boundary.
It has been established that the solution
of this problem is given by the convolution
of 0 and the Gaussian function with = 2t. In the Fourier domain, the solution is:
||2
() = exp 0 () (2.112)
2/ 2
Therefore, the Gaussian diffusion is equivalent to a special low-pass filter. In the
case of a picture, the effect would be image blurring. This blurring does not respect
edges, so the structure is lost.
Figure 2.32 shows the effect of Gaussian diffusion on a picture. On top, the original
picture; in the middle, the image after 10 s of diffusion; in the bottom, the image
after another 10 s of diffusion. The processing has been done with the Program 2.12,
which is similar to the previous program for heat diffusion.
Although it may then easily become unstable, the diffusion process could be
reversed in order to sharpen (or deblur) the image. Figure 2.33 shows an example,
with the original picture on top and the sharpened image below. Notice in Program
2.13 that the diffusion constant and the diffusion time have been decreased.
In 1987 Perona and Malik introduced a celebrated model [149], see also [150], that
has been cited by more than eight thousand papers. The target was to protect, and
even improve, the edges while smoothing more homogeneous regions of the picture.
The diffusion coefficient D() will depend on the local image gradient. For small
gradients, corresponding to homogeneous regions, large values of D() are allowed,
promoting stronger smoothing. On the other hand, for large gradients, corresponding
to edges, smaller values of D() are used to slow down the diffusion or even force the
diffusion to go backwards.
Using now the notation commonly employed for images (u() instead of ( )), the
Perona-Malik formulation of the diffusion becomes:
u
= [D(|u| ) u] (2.113)
t
u(x, y, 0) = u 0 (2.114)
u
= 0 , (x, y) b() (2.115)
n
Two different choices of D(s) where suggested:
1
D(s) = (2.116)
1 + s 2 /2
Therefore, near edges (large gradient) there is backward diffusion that will enhance
these edges.
In the simple one-dimensional diffusion case, one has:
u
= (u) u (2.119)
t
Figure 2.34 shows the one-dimensional diffusion coefficient and the flux function for
D(s) given by (2.116), with = 3.
226 2 Sparse Representations
0.6
0.4
0.4
0.2
0.2
0 0
0 5 10 0 5 10
s s
An extensive research has been devoted to the proposal contained in the Perona-
Malik paper. See [184] for a detailed treatment of the topic with abundant references.
Also in [184] an anisotropic diffusion was introduced using a tensorial diffusion
coefficient.
The Perona-Malik method can be implemented in several discretization ways.
Program 2.15 represents an example of implementation that has fast execution time.
Figure 2.35 shows the result of the method for the denoising of a picture having
salt & pepper noise (this noise has been added using imnoise( ). The noise has been
softened while keeping image features.
2.5 An Additional Repertory of Applicable Concepts and Tools 227
Fig. 2.35 Denoising of image with salt & pepper noise, using P-M method
title('Salt&pepper denoising');
subplot(1,2,2)
imshow(uint8(un));
xlabel('denoised image');
A convenient iterative method for finding extrema of convex functions was proposed
by Bregman in 1967, [28]. Later on, in 2005, it was shown by Osher et al. [142]
that this method was very appropriate for image processing (in particular for total
variation applications).
The method is based on the Bregman divergence, and it can be employed in its
basic iterative version, or as split Bregman iteration [95].
Aspects concerning divergence, distances, similarity, etc. have already been consid-
ered in the chapter on data analysis and classification. They are crucial for important
applications, like for instance face recognition (and distinction between faces).
An expression of the Bregman divergence between two points x and x0 would be
the following:
This concept was introduced by Bregman for differentiable convex functions, and
was given the name Bregman distance by Censor and Lent [46] in 1981.
Figure 2.36 gives a graphical interpretation of this distance D, sitting above the
tangent L.
D
L
x0 x
2.5 An Additional Repertory of Applicable Concepts and Tools 229
Recently, some important authors, [30, 95], used the following expression:
where J and H are defined in n and convex. The associated unconstrained problem
is:
min J (u) + H (u) (2.123)
u
Suppose that H is differentiable. In this case, the Bregman iteration would be:
For the particular problem of TV based image denoising [142], one could take:
1
H (u, f) = A u f22 (2.128)
2
then, the iteration can be expressed in the following simplified form:
For a better approximation a penalty quadratic term could be added, and then:
1
u u(k) 2
u(k+1) = arg min J (u) + H (u(k) ) p (k) , u + 2
(2.132)
u 2
A characteristic aspect of split Bregman method is that it could separate the typical
l1 and l2 portions of important image processing approaches. For instance, suppose
that the minimization problem is:
min d1 + H (u) + d (u) 22 (2.133)
u 2
2.5 An Additional Repertory of Applicable Concepts and Tools 231
d(k) (u) b(k) 2
u(k+1) = arg min H (u) + 2
(2.134)
u 2
d (u(k+1) ) b(k) 2
d(k+1) = arg min d1 + 2
(2.135)
d 2
As you can see, the method involves two optimization steps and a simple update
of b. The first optimization step is differentiable and so it can be solved by a variety
of methods, like Gauss-Seidel or conjugate gradient, or even in the Fourier domain.
The second optimization step can be computed by shrinkage, that is:
d(k+1)
j = shrink((u) j + b(k)
j , 1/) (2.137)
It has been shown in [95] how to apply the split-Bregman method for total variation
(TV) image denoising, based on the ROF model. The method is simple and efficient.
It handles two-dimensional variables.
There are two denoising formulations: the anisotropic problem, and the isotropic
problem. In the anisotropic problem, one has to solve:
min |x u| + y u + u f 22 (2.139)
u 2
Let us replace x u by dx and y u by d y . In order to strengthen the constraints, two
penalty terms were added:
min |dx | + d y + u f 22 + dx x u + d y y u
u 2 2 2
(2.140)
Coming now to the split Bregman algorithm, the first optimization step would be:
(k)
u (k+1) = arg min 2 u f 22 + d x u b(k) +
x x
u 2
(2.141)
+ d y y u b(k)
2 y
232 2 Sparse Representations
Because the system is strictly diagonal, it is recommended in [95] to get the solution
with the Gauss-Seidel method:
u i,(k+1)
j = G i,(k)j = +4
f i, j + +4 (k)
{u i+1, (k) (k) (k)
j + u i1, j + u i, j+1 + u i, j1 +
+ dx,i1, j dx,i, j + d y,i, j1 d y,i, j bx,i1, j + bx,i, j b y,i, j1 + b(k)
(k) (k) (k) (k) (k) (k) (k)
y,i, j }
b(k+1)
y = b(k)
y + ( y u
(k+1)
d yk+1 ) (2.146)
x u (k) + bx(k)
dx(k+1) = max(s k 1/, 0) (2.150)
sk
y u (k) + b(k)
d y(k+1) = max(s k 1/, 0)
y
(2.151)
sk
original denoised
b(k+1)
y = b(k)
y + ( y u
(k+1)
d yk+1 ) (2.153)
where: 2 2
s = x u (k) + bx(k) + y u (k) + b(k)
k
y (2.154)
s k (x u (k) + bx(k) )
dx(k+1) = (2.155)
sk + 1
s k ( y u (k) + b(k)
y )
d y(k+1) = (2.156)
sk + 1
The consideration of matrices has originated a fruitful new research area, which
is very active and expansive. This section introduces some fundamental aspects of
this area.
In a most cited article, [40], Cands and Recht introduced a topic of considerable
interest: the recovery of a data matrix from a sampling of its entries. A number m of
entries are chosen uniformly at random from a matrix M, and the question is whether
it is possible to entirely recover the matrix M from these m entries.
The topic was naturally introduced by extension of compressed sensing, in which a
sparse signal is recovered from some samples taken at random. In the case of matrices,
if the matrix M has low rank or approximately low rank, then accurate and even exact
recovery from random sampling is possible by nuclear norm minimization [40].
By the way, it is now convenient to quote a certain set of norms, denoted as
Schatten-p norms, for p = 1, 2 or . In particular:
Spectral norm:
The largest singular value: X S = max(i (X ) = (X )
Nuclear norm:
N
The sum of singular values: X = i (X ) = (X )1
i=1
Frobenius norm:
1/2 1/2
N
M
t
X F = X i2j = i2 (X ) = (X )2
i=1 j=1 i=1
Using a simple example of a n n matrix M that has all entries equal to zero
except for the first row, it is noted in [40] that this matrix cannot be recovered from
a subset of its entries. There are also more pathological cases as well. In general, we
need the singular vectors of M to be spread across all coordinates. If this happens,
the recovery could be done by solving a rank minimization problem.
minimize X , subject to X i j = Mi j , ( i, j)
2.6 Matrix Completion and Related Problems 235
It is shown in [40] that most matrices can be recovered provided that the number
of samples m obeys:
m C n 6/5 r log n (2.157)
1700
1600
1500
1400
1300
1200
1100
0 10 20 30 40 50 60 70 80 90 100
niter
236 2 Sparse Representations
10
-5
-10
0 10 20 30 40 50 60 70 80 90 100
The issues related to matrix completion have attracted a lot of research activity. For
instance, what assumptions are needed to guarantee the matrix recovery? A recent
article on this aspect is [98], which includes important references. See [69] for another
perspective that connects phase transitions and matrix denoising. An extensive work
on fundamental limits and efficient algorithms is [137]. It happens that the SVD
238 2 Sparse Representations
decomposition applied in the first proposed algorithms can imply excessive compu-
tational effort for large matrices, and so many improvements or alternatives have been
explored [129, 182]. Some authors have proposed the use of other norms, instead of
the nuclear norm (see [117] and references therein).
It was said in [39] that in real world applications the measured entries would be
corrupted by noise (perhaps outliers, [190]). This observation originated a new type
of problem, in which one has to find a decomposition of the observed matrix M into
a low-rank matrix L and a sparse matrix S. The problem was recognized as a robust
PCA analysis in [38] (a most cited paper), being stated as follows:
Again, the new problem lead to a broad range of research efforts, which have
found many interesting applications.
Let us build a simple example by adding a low-rank random matrix and a sparse
matrix (just a diagonal matrix). Figure 2.40 shows images corresponding to these
matrices.
The result of adding the previous two matrices is shown in Fig. 2.41.
Now, the problem is to recover L and S from M. One of the methods that can be
used is an adaptation of the Douglas-Rachford algorithm for this case. According
with [86], it can be formulated as follows:
repeat:
Le = (M + L (k) S (k) )/2 (2.158)
until convergence.
The shrinkage corresponds to the proximity operator for the nuclear norm; and
the soft-thresholding corresponds to the proximity operator for the l1 norm.
An implementation of this algorithm is provided by Program 2.17. The result is
satisfactory as shown in Fig. 2.42.
Program 2.17 Decomposition into low-rank (L) and sparse (S) matrices
% decomposition into low-rank (L) and sparse (S) matrices
% Douglas-Rachford
% a random matrix of rank r
n=100;
r=8; %rank
L0=randn(n,r)*randn(r,n); % a low-rank matrix
% a sparse (diagonal) matrix
nn=1:100;
S0=diag(0.2*nn,0);
% composite original matrix
M=L0+S0;
% parameter settings
lambda=1;
tk=1;
Th=3; %Threshold
nnL=30; % number of loops
rnx=zeros(nnL,1);
L=zeros(n,n); S=zeros(n,n);
% start the algorithm --------------------------------
for nn=1:nnL,
Le=0.5*(M+L-S); Se=0.5*(M-L+S);
% shrinking---------
aux1=(2*Le)-L;
[U D V]=svd(aux1);
for j=1:n,
D(j,j)=max(D(j,j)-Th,0);
end;
aux=U*D*V';
L=L+(tk*(aux-Le));
% soft_threshold---------
aux1=(2*Se)-S;
aux=sign(aux1).*max(0, abs(aux1)-lambda);
S=S+(tk*(aux-Se));
[u,d,v]=svd(L);
rnx(nn)=sum(diag(d)); %nuclear norm, record
end;
% display ------------------------------
figure(1)
subplot(1,2,1)
imshow(L0,[]);
title('original low-rank matrix')
subplot(1,2,2)
imshow(S0,[]);
title('original sparse matrix')
figure(2)
imshow(M,[]);
title('original composite matrix');
2.6 Matrix Completion and Related Problems 241
figure(3)
subplot(1,2,1)
imshow(L,[]);
title('recovered low-rank matrix')
subplot(1,2,2)
imshow(S,[]);
title('recovered sparse matrix')
A number of methods for matrix decomposition has been proposed. Soon after the
publication of [39], the principal component pursuit (PCP) was introduced. In [204]
a study of PCP was presented, with mentions to robust PCA and a reference to [38]
as preprint. In general, the preferred methods are based on alternating minimization
schemes, which are reviewed in the introduction of [170]. One of the factors that
promote the popularity of certain methods is the public availability of code [115,
116, 196]. Theoretical aspects on conditions for the recovery of matrices are treated
in [51].
Some illustrative application examples are [148] for alignment of images, [199]
for low-rank image textures, [54] for face recognition based on robust PCA, [113] on
cognitive radio networks, [203] for medical image analysis, [13] on target tracking
(for example a TV camera focusing on a basketball player during the game), [65] on
computer vision, [56] on genotype imputation, or [181] for movie colorization.
2.7 Experiments
This section includes three experiments, the first is an example of 1D signal denoising,
while the other two examples are applications of matrix completion and decompo-
sition to images.
with:
1 1
1 1
D =
(2.164)
1 1
y = x +n (2.165)
xk+1 = y D T z k (2.167)
1
z k+1 = cli p(z k + Dxk+1 ; /2) (2.168)
3 3
original clean signal signal with added noise
2 2
1 1
0 0
-1 -1
-2 -2
-3 -3
0 1000 2000 3000 4000 0 1000 2000 3000 4000
250 3
J evolution denoised signal
2
200
1
150 0
-1
100
-2
50 -3
0 5 10 15 20 0 1000 2000 3000 4000
niter
z=zeros(1,N-1);
J=zeros(1,niter);
alpha=3;
lambda=0.5; th=lambda/2;
%start the algorithm ------------------------
for nn=1:niter,
aux=[-z(1) -diff(z) z(end)];
y=x-aux;
aux1=sum(abs(y-x).^2);
aux2=sum(abs(diff(y)));
J(nn)=aux1+(lambda*aux2);
z=z+((1/alpha)*diff(y));
z=max(min(z,th),-th);
end
% display ----------------------
figure(1)
subplot (1,2,1)
plot(a,'k')
axis([0 4000 -3 3]);
title('original clean signal')
subplot(1,2,2)
plot(x,'k')
axis([0 4000 -3 3]);
title('signal with added noise')
figure(2)
subplot(1,2,1)
plot(J,'k')
xlabel('niter')
title('J evolution')
subplot(1,2,2)
plot(y,'k')
axis([0 4000 -3 3]);
title('denoised signal')
A direct example of matrix completion is the case of taking at random some pixels
of a picture, and see if it is possible to recover the picture from these samples.
2.7 Experiments 245
An image with evident redundancies has been chosen, Fig. 2.45. It could be
expected that this image has approximately low rank. Anyway, a certain value of
p was tried.
The sampling and recovery experiment was done with the Program 2.19, which
is just an adaptation of the Program 2.16.
Figure 2.46 shows the evolution of the nuclear norm along iterations.
And Fig. 2.47 shows the good result of the image recovery.
10
5
0 5 10 15 20 25 30
niter
246 2 Sparse Representations
% X update
% R(b-M(Y)) term
O=zeros(L,1);
% accumulating repeated entries
for j=1:p;
ox=ix(j);
O(ox)=O(ox)+(b(j)-Y(ox));
end;
Q=reshape(O,[n n]);
% proxF (indicator function)
X=Y+Q;
% Y update
%proxG (soft thresholding of singular values):
P=(2*X)-Y;
[U,D,V]=svd(P);
for j=1:n,
aux=D(j,j);
if abs(aux)<=gamma,
D(j,j)=0;
else
if aux>gamma, D(j,j)=aux-gamma; end
if aux<-gamma, D(j,j)=aux+gamma; end;
end;
end;
S=U*D*V'; % result of thresholding
Y=Y+ (lambda*(S-X));
% recording
[u,d,v]=svd(X);
rnx(nn)=sum(diag(d)); %nuclear norm
end
%display -----------------------------
% evolution of nuclear norm
figure(1)
plot(rnx,'k');
title('evolution of nuclear norm')
xlabel('niter');
figure(2)
imshow(A,[]);
title('original picture')
figure(3)
imshow(X,[]);
title('reconstructed picture')
248 2 Sparse Representations
Suppose you have a photograph and someone has written some text on it. It would
be good to remove that text. Given this situation, matrix decomposition could be
helpful, as far as the picture has low rank, and the text corresponds to a sparse matrix
of pixels.
In order to explore this application, a synthetic problem has been fabricated:
some text has been added to a wall of bricks. It is a crude simulation of a graffiti.
Figure 2.48 shows the original image.
Since we wanted to explore alternating minimization schemes, a fast scheme
proposed by [159]. In this paper, the matrix decomposition problem is treated as:
1
minimize ( L + S M F + S1 ), subject to rank(L) = t
2
The problem is solved with the following alternating minimization:
L (k+1) = arg min( L + S (k) M F ), subject to rank(L) = t
L
S (k+1) = arg min( L (k+1) + S M F + S1 ) (2.170)
S
The first sub-problem is solved with a partial SVD of S (k) M. Only t components
of the SVD are selected (corresponding to the t largest singular values). This is
done with the lansvd( ) routine included in the PROPACK library [114], which is
commonly found in the public available codes.
The second sub-problem is a element-wise shrinkage of M L (k) .
Program 2.20 Decomposition into low-rank (L) and sparse (S) matrices
% decomposition into low-rank (L) and sparse (S) matrices
% Alternating Minimization
% load image
figu=imread('wall.jpg'); %read picture
F=double(figu);
n=380;
M=F(1:n,1:n); %crop;
aux=mean(mean(M));
M=M-aux;
% parameter settings
lambda=0.5;
lambF=1.01;
Th=0.01; %Threshold
rank0=1; %intial rank guess
irk=1; %for rank increments
nnL=50; % number of loops
rank=rank0; %current rank
% start the algorithm --------------------------------
%
[UL SL VL] = lansvd(M, rank, 'L'); %partial SVD
L1=UL*SL*VL'; %initial low-rank approximation
aux=M-L1;
S1= sign(aux).*max(0,abs(aux)-lambda); %shrinkage
for nn=2:nnL,
if irk==1,
lambda = lambda * lambF; % lambda is modified in each iteration
rank = rank + irk; % rank is increased " " "
250 2 Sparse Representations
end;
[UL SL VL] = lansvd(M-S1, rank, 'L'); %partial SVD
L1=UL*SL*VL'; %current low-rank approximation
aux=M-L1;
S1= sign(aux).*max(0,abs(aux)-lambda); %shrinkage
% change rank increment when appropriate
vv=diag(SL);
rho=vv(end)/sum(vv(1:end-1));
if rho<Th,
irk=0;
else
irk=1;
end;
end;
% display ------------------------------
figure(1)
imshow(M,[]);
title('original picture')
figure(2)
subplot(1,2,1)
imshow(L1,[]);
title('low-rank component')
subplot(1,2,2)
imshow(S1,[]);
title('sparse component')
2.8 Resources
The file exchange web site of Mathworks has several MATLAB programs of interest.
Also, some of the methods introduced in this chapter are implemented in some
routines of the MATLAB Statistics Toolbox.
2.8.1 MATLAB
2.8.1.1 Toolboxes
l1-MAGIC:
http://users.ece.gatech.edu/~justin/l1magic/
SparseLab (Stanford University):
http://sparselab.stanford.edu/
2.8 Resources 251
SALSA:
http://cascais.lx.it.pt/~mafonso/salsa.html
TwIST:
http://www.lx.it.pt/~bioucas/TwIST/TwIST.htm
SpaRSA:
http://www.lx.it.pt/~mtf/SpaRSA/
MATLAB scripts for ADMM (Stanford University):
http://www.web.stanford.edu/~boyd/papers/admm!
YALL1 (Rice University):
http://yall1.blogs.rice.edu/
Beginners code for CS (A. Weinstein):
http://control.mines.edu/mediawiki/upload/f/f4/Beginners_code.pdf
MATLAB script for Chan-Vese segmentation (Fields Institute):
http://www.math.ucla.edu/~wittman/Fields/cv.m
Low-Rank Matrix Recovery and Completion via Convex Optimization:
http://perception.csl.illinois.edu/matrix-rank/home.html
252 2 Sparse Representations
2.8.2 Internet
Dave Donoho:
http://www-stat.stanford.edu/~donoho
Emmanuel Cands (software):
http://statweb.stanford.edu/~candes/software.html
Michael Elad:
www.cs.technion.ac.il/~elad/index.html
A. Rakotomamonjy:
http://asi.insa-rouen.fr/enseignants/~arakoto/
Mark Schmidt:
http://www.cs.ubc.ca/~schmidtm/
Mark A. Davenport (CoSaMP):
http://users.ece.gatech.edu/~mdavenport/software/
SeDuMi (Lehigh University):
http://sedumi.ie.lehigh.edu/
Tiany Zhou (CS recontruction algorithms):
https://tianyizhou.wordpress.com/2010/08/23/compressed-sensing-review-1-
reconstruction-algorithms/
CS Audio Demonstration:
http://sunbeam.ece.wisc.edu/csaudio/
EPFL Signal Processing Lab (Lausanne):
http://lts2www.epfl.ch/people/gilles/softwares
Bio Imaging & Signal Processing Lab.:
http://bispl.weebly.com/software.html
J. Huang:
http://ranger.uta.edu/~huang/index.html
Douglas-Rachford and projection methods:
http://carma.newcastle.edu.au/DRmethods/
Bamdev Mishra (Riemannian matrix completion):
https://sites.google.com/site/bamdevm/codes/qgeommc
Principal Component Pursuit papers and code:
http://investigacion.pucp.edu.pe/grupos/gpsdi/publicaciones-2/
Thierry Bouwmans (surveys):
https://sites.google.com/site/thierrybouwmans/recherche---background-
subtraction---survey
2.8 Resources 253
Numerical-tours:
http://www.numerical-tours.com/links/
Fast l1 Minimization Algorithms:
http://www.eecs.berkeley.edu/~yang/software/l1benchmark/
Compressive sensing:
https://sites.google.com/site/igorcarron2/compressivesensing2.0
The advanced matrix factorization jungle:
https://sites.google.com/site/igorcarron2/matrixfactorizations
Compressive Sensing: The Big Picture:
https://sites.google.com/site/igorcarron2/cs
Matrix completion solvers:
http://www.ugcs.caltech.edu/~srbecker/wiki/Category:Matrix{_}Completion{_}
Solvers
Research in Computational Science:
http://www.csee.wvu.edu/~xinl/source.html
References
38. E.J. Cands, X. Li, Y. Ma, J. Wright, Robust principal component analysis? J. ACM (JACM)
58(3), 11 (2011)
39. E.J. Candes, Y. Plan, Matrix completion with noise. Proc. IEEE 98(6), 925936 (2010)
40. E.J. Cands, B. Recht, Exact matrix completion via convex optimization. Found. Comput.
Math. 9(6), 717772 (2009)
41. E.J. Candes, T. Tao, Near-optimal signal recovery from random projections: Universal encod-
ing strategies? IEEE T. Inf. Theory 52(12), 54065425 (2006)
42. E.J. Candes, M.B. Wakin, An introduction to compressive sampling. IEEE Signal Process.
Mgz. 2130 (2008)
43. S. Cao, Q. Wang, Y. Yuan, J. Yu, Anomaly event detection method based on compressive
censing and iteration in wireless sensor networks. J. Netw. 9(3), 711718 (2014)
44. C. Caramanis, S. Sanghavi, Large Scale Optimization (The University of Texas at Austin,
Lecture 24 of EE381V Course, 2012). http://users.ece.utexas.edu/~sanghavi/courses/scribed_
notes/Lecture_24_Scribe_Notes.pdf
45. V. Caselles, A. Chambolle, M. Novaga, Total variation in imaging, in Handbook of Mathe-
matical Methods in Imaging, pp. 10161057 (Springer Verlag, 2011)
46. Y. Censor, A. Lent, An iterative row-action method for interval convex programming. J. Optim.
Theory Appl. 34(3), 321353 (1981)
47. A. Chambolle, P.L. Lions, Image recovery via total variation minimization and related prob-
lems. Numer. Math. 76(2), 167188 (1997)
48. T. Chan, S. Esedoglu, F. Park, A. Yip, Recent developments in total variation image restoration.
Math. Models Comput. Vision 17, (2005)
49. T.F. Chan, S. Esedoglu, Aspects of total variation regularized L1 function approximation.
SIAM J. Appl. Math. 65(5), 18171837 (2005)
50. T.F. Chan, S. Esedoglu, F.E. Park, Image decomposition combining staircase reduction and
texture extraction. J. Visual Commun. Image Represent. 18(6), 464486 (2007)
51. V. Chandrasekaran, S. Sanghavi, P.A. Parrilo, A.S. Willsky, Rank-sparsity incoherence for
matrix decomposition. SIAM J. Optim. 21(2), 572596 (2011)
52. P. Chatterjee, I.P.P. Milanfar, Denoising using the K-SVD Method (EE 264 Course, University
of California at Santa Cruz, 2007). https://users.soe.ucsc.edu/~priyam/ksvd_report.pdf
53. K.M. Cheman, Optimization techniques for solving basis pursuit problems. Masters thesis,
North Carolina State Univ., Raleigh, NC, USA, 2006
54. F. Chen, C.C.P. Wei, Y.C. Wang, Low-rank matrix recovery with structural incoherence for
robust face recognition, in Proceedings IEEE Conference on Computer Vision and Pattern
Recognition, (CVPR), pp. 26182625 (2012)
55. S.S. Chen, D.L. Donoho, M.A. Saunders, Atomic decomposition by basis pursuit. SIAM J.
Sci. Comput. 20(1), 3361 (1998)
56. E.C. Chi, H. Zhou, G.K. Chen, D.O. Del Vecchyo, K. Lange, Genotype imputation via matrix
completion. Genome Res. 23(3), 509518 (2013)
57. M.G. Christensen, S.H. Jensen, On compressed sensing and its application to speech and audio
signals, in Proceedings 43th IEEE Asilomar Conference on Signals, Systems and Computers,
pp. 356360 (2009)
58. I. Cimrak, Analysis of the bounded variation and the G regularization for nonlinear inverse
problems. Math. Meth. Appl. Sci. 33(9), 11021111 (2010)
59. A. Cohen, W. Dahmen, I. Daubechies, R. DeVore, Harmonic analysis of the space BV. Revista
Matematica Iberoamericana 19(1), 235262 (2003)
60. R. Coifman, F. Geshwind, Y. Meyer, Noiselets. Appl. Comput. Harmonic Anal. 10, 2744
(2001)
61. P.L. Combettes, J.C. Pesquet, Proximal splitting methods in signal processing, in Fixed-point
Algorithms for Inverse Problems in Science and Engineering, pp. 185212 (Springer, 2011)
62. S.B. Damelin, W. Jr, Miller, The Mathematics of Signal Processing (Cambridge University
Press, 2012)
63. I. Daubechies, M. Fornasier, I. Loris, Accelerated projected gradient method for linear inverse
problems with sparsity constraints. J. Fourier Anal. Appl. 14(56), 764792 (2008)
256 2 Sparse Representations
64. M.A. Davenport, M.F. Duarte, Y.C. Eldar, G. Kutyniok, Introduction to compressed sensing,
eds. by Y.C. Eldar, G. Kutyniok. Compressed Sensing (Cambridge University Press, 2013)
65. F. De la Torre, M.J. Black, Robust principal component analysis for computer vision, in
Proceedings Eighth IEEE International Conference on Computer Vision, (ICCV ), vol. 1, pp.
362369 (2001)
66. G. Dogan, P. Morin, R.H. Nochetto, A variational shape optimization approach for image
segmentation with a Mumford-Shah functional. SIAM J. Sci. Comput. 30(6), 30283049
(2008)
67. D. Donoho, J. Tanner, Observed universality of phase transitions in high-dimensional geom-
etry, with implications for modern data analysis and signal processing. Philos. Trans. Roy.
Soc. A: Math. Phys. Eng. Sci. 367(1906), 42734293 (2009)
68. D.L. Donoho, Compressed sensing. IEEE T. Inf. Theory 52(4), 12891306 (2006)
69. D.L. Donoho, M. Gavish, A. Montanari, The phase transition of matrix recovery from gaussian
measurements matches the minimax MSE of matrix denoising. Proc. Natl. Acad. Sci. 110(21),
84058410 (2013)
70. D.L. Donoho, J. Tanner, Precise undersampling theorems. Proc. IEEE 98(6), 913924 (2010)
71. D.L. Donoho, Y. Tsaig, Fast solution of L1-norm minimization problems when the solution
may be sparse. IEEE T. Inf. Theory 54(11), 47894812 (2006)
72. I. Drori, D.L. Donoho, Solution of L1 minimization problems by LARS/homotopy methods,
in Proceedings IEEE International Conference Acoustics, Speech and Signal Processing,
ICASSP, vol. 3 (2006)
73. M.F. Duarte, Y.C. Eldar, Structured compressed sensing: from theory to applications. IEEE
Trans. Signal Process. 59(9), 40534085 (2011)
74. J. Eckstein, Splitting methods for monotone operators with applications to parallel optimiza-
tion. Ph.D. thesis, MIT, 1989
75. B. Efron, T. Hastie, I. Johnstone, R. Tibshirani, Least angle regression. Ann. Stat. 32(2),
407451 (2004)
76. M. Elad, Sparse and Redundant Representations (Springer Verlag, 2010)
77. M. Elad, Sparse and redundant representation modelingwhat next? IEEE Signal Process.
Lett. 19(12), 922928 (2012)
78. M. Elad, M. Aharon, Image denoising via sparse and redundant representations over learned
dictionaries. IEEE T. Image Process. 15(12), 37363745 (2006)
79. M. Elad, M.A. Figueiredo, Y. Ma, On the role of sparse and redundant representations in
image processing. Proc. IEEE 98(6), 972982 (2010)
80. J. Ender, A brief review of compressive sensing applied to radar, in Proceedings
14thInternational Radar Symposium (IRS), pp. 316 (2013)
81. M.A. Figueiredo, R.D. Nowak, S.J. Wright, Gradient projection for sparse reconstruction:
application to compressed sensing and other inverse problems. IEEE J. Selected Topics in.
Signal Process. 1(4), 586597 (2007)
82. S. Foucart, H. Rauhut, A Mathematical Introduction to Compressive Sensing (Birkhuser,
2010)
83. K. Fountoulakis, J. Gondzio, P. Zhlobich, Matrix-free interior point method for compressed
sensing problems. Math. Programm. Comput. 131 (2012)
84. J. Friedman, T. Hastie, R. Tibshirani, A note on the group Lasso and a sparse group Lasso
(Dept. Statistics, Stanford University, 2010) arXiv preprint arXiv:1001.0736
85. L. Gan, Block compressed sensing of natural images, in Proceedings IEEE International
Conference Digital Signal Processing, pp. 403406 (2007)
86. S. Gandy, I. Yamada, Convex optimization techniques for the efficient recovery of a sparsely
corrupted low-rank matrix. J. Math-for-Indus. 2(5), 147156 (2010)
87. V. Ganesan, A study of compressive sensing for application to structural helath monitoring.
Masters thesis, University of Central Florida, 2014
88. J. Gao, Q. Shi, T.S. Caetano, Dimensionality reduction via compressive sensing. Pattern
Recogn. Lett. 33(9), 11631170 (2012)
89. P. Getreuer, Chan-Vese segmentation. Image Processing On Line (2012)
References 257
90. P. Getreuer, Rudin-osher-fatemi total variation denoising using split Bregman. Image Process-
ing On Line 10 (2012)
91. J.R. Gilbert, C. Moler, R. Schreiber, Sparse matrices in MATLAB: design and implementation.
SIAM J. Matrix Anal. Appl. 13(1), 333356 (1992)
92. J. Gilles, Noisy image decomposition: a new structure, texture and noise model based on local
adaptivity. J. Math. Imag. Vision 28(3), 285295 (2007)
93. J. Gilles, Image decomposition: theory, numerical schemes, and performance evaluation. Adv.
Imag. Electron Phys. 158, 89137 (2009)
94. T. Goldstein, B. ODonoghue, S. Setzer, R. Baraniuk, Fast alternating direction optimization
methods. SIAM J. Imag. Sci. 7(3), 15881623 (2014)
95. T. Goldstein, S. Osher, The split Bregman method for L1 regularized problems. SIAM J.
Imag. Sci. 2(2), 323343 (2009)
96. Y. Gousseau, J.M. Morel, Are natural images of bounded variation? SIAM J. Math. Anal.
33(3), 634648 (2001)
97. X. Guo, S. Li, X. Cao, Motion matters: a novel framework for compressing surveillance videos,
in Proceedings 21st ACM International Conference on Multimedia, pp. 549552 (2013)
98. M. Hardt, R. Meka, P. Raghavendra, B. Weitz, Computational Limits for Matrix Completion
(IBM Research Almaden, 2014). arXiv preprint arXiv:1402.2331
99. J. Haupt, W.U. Bajwa, M. Rabbat, R. Nowak, Compressed sensing for networked data. IEEE
Signal Process. Mag. 25(2), 92101 (2008)
100. K. Hayashi, M. Nagahara, T. Tanaka, A users guide to compressed sensing for communica-
tions systems. IEICE Trans. Commun. 96(3), 685712 (2013)
101. A.E. Hoerl, R.W. Kennard, Ridge regression: biased estimation for nonorthogonal problems.
Technometrics 12(1), 5567 (1970)
102. H. Huang, S. Misra, W. Tang, H. Barani, H. Al-Azzawi, Applications of Compressed Sensing
in Communications Networks (New Mexico State University, USA, 2013). arXiv preprint
arXiv:1305.3002
103. Y. Huang, J.L. Beck, S. Wu, H. Li, Robust Bayesian compressive sensing for signals in
structural health monitoring. Comput.-Aid. Civil Infrastruct. Eng. 29(3), 160179 (2014)
104. H. Jung, K. Sung, K.S. Nayak, E.Y. Kim, J.C. Ye, K-t FOCUSS: a general compressed sensing
framework for high resolution dynamic MRI. Magn. Reson. Med. 61(1), 103116 (2009)
105. O. Kardani, A.V. Lyamin, K. Krabbenhoft, A comparative study of preconditioning techniques
for large sparse systems arising in finite element analysis. IAENG Intl. J. Appl. Math. 43(4),
19 (2013)
106. S.J. Kim, K. Koh, M. Lustig, S. Boyd, D. Gorinevsky, An interior-point method for large-scale
L1-regularized least squares. IEEE J. Sel. Top. Sign. Process. 1(4), 606617 (2007)
107. M. Kowalski, B. Torrsani, Structured sparsity: from mixed norms to structured shrinkage, in
Proceedings SPARS09-Signal Processing with Adaptive Sparse Structured Representations
(2009)
108. G. Kutyniok, Theory and applications of compressed sensing. GAMM-Mitteilungen 36(1),
79101 (2013)
109. V. Le Guen, Cartoon+ Texture Image Decomposition by the TV-L1 Model. IPOL, Image
Processing On Line (2014). http://www.ipol.im/pub/algo/gjmr_line_segment_detector/
110. F. Lenzen, F. Becker, J. Lellmann, Adaptive second-order total variation: an approach
aware of slope discontinuities, in Scale Space and Variational Methods in Computer Vision,
pp. 6173 (Springer, 2013)
111. Z. Li, Y. Zhu, H. Zhu, M. Li, Compressive sensing approach to urban traffic sensing,
in Proceedings IEEE International Conference Distributed Computing Systems, (ICDCS),
pp. 889898 (2011)
112. H. Liebgott, A. Basarab, D. Kouame, O. Bernard, D. Friboulet, Compressive sensing in
medical ultrasound, in Proceedings IEEE International Ultrasonics Symposium, (IUS),
pp. 16 (2012)
113. F. Lin, Z. Hu, S. Hou, J. Yu, C. Zhang, N. Guo, K. Currie, Cognitive radio network as wireless
sensor network (ii): Security consideration, in Proceedings IEEE National Aerospace and
Electronics Conference,(NAECON), pp. 324328 (2011)
258 2 Sparse Representations
114. Z. Lin. Some Software Packages for Partial SVD Computation (School of EECS, Peking
University, 2011). arXiv preprint arXiv:1108.1548
115. Z. Lin, M. Chen, Y. Ma, The Augmented Lagrange Multiplier Method for Exact Recov-
ery of Corrupted Low-rank Matrices (Microsoft Research Asia, 2010). arXiv preprint
arXiv:1009.5055
116. Z. Lin, A. Ganesh, J. Wright, L. Wu, M. Chen, Y. Ma. Fast Convex Optimization Algorithms
for Exact Recovery of a Corrupted Low-rank Matrix (Microsoft Research Asia, 2009). http://
yima.csl.illinois.edu/psfile/rpca_algorithms.pdf
117. D. Liu, T. Zhou, H. Qian, C. Xu, Z. Zhang, A nearly unbiased matrix completion approach,
in Machine Learning and Knowledge Discovery in Databases, pp. 210225 (Springer Verlag,
2013)
118. G. Liu, W. Kang, IDMA-Based compressed sensing for ocean monitoring information acqui-
sition with sensor networks. Math. Probl. Eng. 2014, 113 (2014)
119. C. Luo, F. Wu, J. Sun, C.W. Chen, Compressive data gathering for large-scale wireless sensor
networks, in Proceedings 15th ACM Annual International Conference on Mobile Computing
and Networking, pp. 145156 (2009)
120. M. Lustig, D. Donoho, J.M. Pauly, Sparse MRI: the application of compressed sensing for
rapid MR imaging. Magn. Reson. Med. 58(6), 11821195 (2007)
121. M. Lysaker, X.C. Tai, Iterative image restoration combining total variation minimization and
a second-order functional. Int. J. Comput. Vision 66(1), 518 (2006)
122. J. Mairal, M. Elad, G. Sapiro, Sparse representation for color image restoration. IEEE T.
Image Process. 17(1), 5369 (2008)
123. J. Mairal, B. Yu, Complexity Analysis of the Lasso Regularization Path (Department of Sta-
tistics, University of California at Berkeley, 2012) arXiv preprint arXiv:1205.0079
124. D. Mascarenas, A. Cattaneo, J. Theiler, C. Farrar, Compressed sensing techniques for detecting
damage in structures. Struct. Health Monit. (2013)
125. P. Maurel, J.F. Aujol, G. Peyr, Locally parallel texture modeling. SIAM J. Imag. Sci. 4(1),
413447 (2011)
126. D. McMorrow, Compressive sensing for DoD sensor systems. Technical report (MITRE Corp,
2012)
127. J. Meng, H. Li, Z. Han, Sparse event detection in wireless sensor networks using compres-
sive sensing, in Proceedings IEEE 43rd Annual Conference Information Sciences and Sys-
tems,(CISS), pp. 181185 (2009)
128. Y. Meyer, Oscillating patterns in image processing and nonlinear evolution equations: the
fifteenth Dean Jacqueline B. Lewis memorial lectures. AMS Bookstore 22 (2001)
129. M. Michenkova, Numerical Algorithms for Low-rank Matrix Completion Problems (Swiss
Federal Institute of Technology, Zurich, 2011). http://sma.epfl.ch/~anchpcommon/students/
michenkova.pdf
130. D. Mumford, J. Shah, Optimal approximations by piecewise smooth functions and associated
variational problems. Comm. Pure Appl. Math. 42(5), 577685 (1989)
131. S. Mun, J.E. Fowler, Block compressed sensing of images using directional transforms, in
Proceedings IEEE International Conference Image Processing, (ICIP), pp. 30213024 (2009)
132. M. Nagahara, T. Matsuda, K. Hayashi, Compressive sampling for remote control systems.
IEICE Trans. Fundam. Electron. Commun. Comput. Sci. 95(4), 713722 (2012)
133. D. Needell, J.A. Tropp, CoSaMP: iterative signal recovery from incomplete and inaccurate
samples. Appl. Comput. Harmon. Anal. 26(3), 301321 (2009)
134. Y. Nesterov, Smooth minimization of non-smooth functions. Math. Programm. 103(1), 127
152 (2005)
135. M. Nikolova, A variational approach to remove outliers and impulse noise. J. Math. Imag.
Vision 20(12), 99120 (2004)
136. R. Nowak, M. Figueiredo, Fast wavelet-based image deconvolution using the EM algorithm,
in Proceedings 35th Asilomar Conference Signals, Systems and Computers (2001)
137. S. Oh, Matrix Completion: Fundamental Limits and Efficient Algorithms. Ph.D. thesis, Stan-
ford University, 2010
References 259
138. B.A. Olshausen, D.J. Field, Emergence of simple-cell receptive field properties by learning a
sparse code for natural images. Nature 381(6583), 607609 (1996)
139. B.A. Olshausen, D.J. Field, Sparse coding with an overcomplete basis set: a strategy employed
by V1? Vision Res. 37(23), 33113325 (1997)
140. B.A. Olshausen, D.J. Field, Sparse coding of sensory inputs. Curr. Opin. Neurobiol. 14(4),
481487 (2004)
141. M.R. Osborne, B. Presnell, B.A. Turlach, A new approach to variable selection in least squares
problems. IMA J. Numer. Anal. 20(3), 389403 (2000)
142. S. Osher, M. Burger, D. Goldfarb, J. Xu, W. Yin, An iterative regularization method for total
variation-based image restoration. Multiscale Model. Simul. 4(2), 460489 (2005)
143. S. Osher, J.A. Sethian, Fronts propagating with curvature-dependent speed: algorithms based
on Hamilton-Jacobi formulations. J. Comput. Phys. 79(1), 1249 (1988)
144. I. Papusha, Fast Automatic Background Extraction Via Robust PCA (Caltech, 2011). http://
www.cds.caltech.edu/~ipapusha/pdf/robust_pca_apps.pdf
145. N. Parikh, S. Boyd, Proximal algorithms. Found. Trends Optim. 1(3), 123231 (2013)
146. J.C. Park, B. Song, J.S. Kim, S.H. Park, H.K. Kim, Z. Liu, W.Y. Song, Fast compressed
sensing-based CBCT reconstruction using Barzilai-Borwein formulation for application to
on-line IGRT. Med. Phys. 39(3), 12071217 (2012)
147. S. Patterson, Y. C. Eldar, I. Keidar, Distributed Compressed Sensing for Static and Time-
varying Networks (Rensselaer Polytechnic Institute, 2013). arXiv preprint arXiv:1308.6086
148. Y. Peng, A. Ganesh, J. Wright, W. Xu, Y. Ma, RASL: robust alignment by sparse and low-rank
decomposition for linearly correlated images. IEEE Trans. Pattern Anal. Mach. Intell. 34(11),
22332246 (2012)
149. P. Perona, J. Malik, Scale space and edge detection using anisotropic diffusion, in Proceedings
IEEE Computer Society Workshop on Computer Vision, pp. 1622 (1987)
150. P. Perona, J. Malik, Scale space and edge detection using anisotropic diffusion. IEEE T. Pattern
Anal. Mach. Intell. 12, 629639 (1990)
151. G. Peyre, Matrix Completion with Nuclear Norm Minimization. A Numerical Tour of Sig-
nal Processing (2014). http://gpeyre.github.io/numerical-tours/matlab/sparsity_3_matrix_
completion/
152. T. Pock, D. Cremers, H. Bischof, A. Chambolle, An algorithm for minimizing the Mumford-
Shah functional, in Proceedings IEEE 12th International Conference on Computer Vision,
pp. 11331140 (2009)
153. S. Qaisar, R.M. Bilal, W. Iqbal, M. Naureen, S. Lee, Compressive sensing: from theory to
applications, a survey. J. Commun. Netw. 15(5), 443456 (2013)
154. C. Quinsac, A. Basarab, D. Kouam, Frequency domain compressive sampling for ultrasound
imaging. Adv. Acoust. Vib. 116, 2012 (2012)
155. I. Ram, M. Elad, I. Cohen, Image processing using smooth ordering of its patches. IEEE T.
Image Process. 22(7), 27642774 (2013)
156. M.A. Rasmussen, R. Bro, A tutorial on the Lasso approach to sparse modeling. Chemom.
Intell. Lab. Syst. 119, 2131 (2012)
157. H. Rauhut, Compressive sensing and structured random matrices, ed. by Formasier. Theoret-
ical Foundations and Numerical Methods for Sparse Recovery (Walter de Gruyter, Berlin,
2010)
158. J. Richy, Compressive sensing in medical ultrasonography. Masters thesis, Kungliga Tekniska
Hgskolan, 2012
159. P. Rodrguez, B. Wohlberg, Fast principal component pursuit via alternating minimization, in
Proceedings IEEE International Conference on Image Processing, (ICIP), pp. 6979 (2013)
160. J. Romberg, Imaging via compressive sampling (introduction to compressive sampling and
recovery via convex programming). IEEE Signal Process. Magz. 25(2), 1420 (2008)
161. R. Rubinstein, A.M. Bruckstein, M. Elad, Dictionaries for sparse representation modeling.
Proc. IEEE 98(6), 10451057 (2010)
162. R. Rubinstein, M. Zibulevsky, M. Elad, Efficient implementation of the K-SVD algorithm
and the Batch-OMP method. Technical report, Department of Computer Science, Technion,
Israel, 2008. Technical CS08
260 2 Sparse Representations
163. L.I. Rudin, S. Osher, E. Fatemi, Nonlinear total variation based noise removal algorithms.
Physica D: Nonlinear Phenomena 60(1), 259268 (1992)
164. L. Ryzhik, Lecture Notes for Math 221 (Stanford University, 2013). http://math.stanford.edu/
~ryzhik/STANFORD/STANF221-13/stanf221-13-notes.pdf
165. M. Saunders, PDCO Primal-dual Interior Methods (CME 338 Lecture Notes 7, Stanford
University, 2013). https://web.stanford.edu/group/SOL/software/pdco/pdco.pdf
166. M Schmidt, Least squares optimization with L1-norm regularization. Technical report, Uni-
versity of British Columbia, 2005. Project Report
167. I. Selesnick, Total Variation Denoising (an MM Algorithm) (New York University, 2012).
http://eeweb.poly.edu/iselesni/lecture_notes/TVDmm/TVDmm.pdf
168. I.W. Selesnick, Sparse Signal Restoration (New York University, 2010). http://eeweb.poly.
edu/iselesni/lecture_notes/sparse_signal_restoration.pdf
169. I.W. Selesnick, I. Bayram, Total Variation Filtering (New York University, 2010). http://
eeweb.poly.edu/iselesni/lecture_notes/TVDmm/
170. Y. Shen, Z. Wen, Y. Zhang, Augmented Lagrangian alternating direction method for matrix
separation based on low-rank factorization. Optim. Meth. Softw. 29(2), 239263 (2014)
171. E.P. Simoncelli, B.A. Olshausen, Natural image statistics and neural representation. Ann. Rev.
Neurosci. 24(1), 11931216 (2001)
172. J.L. Starck, M. Elad, D. Donoho, Redundant multiscale transforms and their application for
morphological component separation. Adv. Imag. Electron Phys. 132(82), 287348 (2004)
173. J.L. Starck, Y. Moudden, J. Bobin, M. Elad, D.L. Donoho, Morphological component analysis.
Opt. Photon. 115 (2005)
174. J.L. Starck, F. Murtagh, J.M. Fadilii, Sparse Image and Signal Processing (Cambridge Uni-
versity Press, 2010)
175. D. Sundman, Compressed sensing: algorithms and applications. Masters thesis, KTH Elec-
trical Engineering, Stockholm, 2012. Licenciate thesis
176. R. Tibshirani, Regression shrinkage and selection via the Lasso. J. Royal Stat. Soc. 58, 267
288 1996). series B
177. J.A. Tropp, S.J. Wright, Computational methods for sparse solution of linear inverse problems.
Proc. IEEE 98(6), 948958 (2010)
178. R. Valentine, Image Segmentation with the Mumford Shah Functional (2007). http://
coldstonelabs.org/files/science/math/Intro-MS-Valentine.pdf
179. S.A. Van De Geer, P. Bhlmann, On the conditions used to prove oracle results for the Lasso.
Electron. J. Stat. 3, 13601392 (2009)
180. L.A. Vese, S.J. Osher, Modeling textures with total variation minimization and oscillating
patterns in image processing. J. Sci. Comput. 19(13), 553572 (2003)
181. S. Wang, Z. Zhang, Colorization by matrix completion, in Proceedings 26th AAAI Conference
on Artificial Intelligence, pp. 11691175 (2012)
182. Z. Wang, M.J. Lai, Z. Lu, J. Ye, Orthogonal Rank-one Matrix Pursuit for Low Rank
Matrix Completion (The Biodesign Institue, Arizona State University, 2014) arXiv preprint
arXiv:1404.1377
183. S.J. Wei, X.L. Zhang, J. Shi, G. Xiang, Sparse reconstruction for SAR imaging based on
compressed sensing. Prog. Electromagn. Res. 109, 6381 (2010)
184. J. Weickert, Anisotropic Diffusion in Image Processing, vol. 1 (Teubner, Stuttgart, 1998)
185. S. Weisberg, Applied Linear Regression (Wiley, 1980)
186. S.J. Wright, R.D. Nowak, A.T. Figueiredo, Sparse reconstruction by separable approximation.
IEEE Trans. Signal Process. 57(7), 24792493 (2009)
187. H. Xu, C. Caramanis, S. Mannor, Robust regression and Lasso. IEEE T. Inf. Theory 56(7),
35613574 (2010)
188. X. Xu, Online robust principal component analysis for background subtraction: a system
evaluation on Toyota car data. Masters thesis, University of Illinois at Urbana-Champaign,
2014
189. Y. Xu, W. Yin, A fast patch-dictionary method for whole-image recovery. Technical report,
UCLA, 2013. CAM13-38
References 261
190. M. Yan, Y. Yang, S. Osher, Exact low-rank matrix completion from sparsely corrupted entries
via adaptive outlier pursuit. J. Sci. Comput. 56(3), 433449 (2013)
191. S. Yan, C. Wu, W. Dai, M. Ghanem, Y. Guo, Environmental monitoring via compressive
sensing, in Proceedings ACM International Workshop on Knowledge Discovery from Sensor
Data, pp. 6168 (2012)
192. J. Yang, Y. Zhang, Alternating direction algorithms for L1 problems in compressive sensing.
SIAM J. Sci. Comput. 33(1), 250278 (2011)
193. W. Yin, D. Goldfarb, S. Osher, Total variation based image cartoon-texture decomposition.
Technical report, Columbia Univ., Dep. Industrial Eng. & Operation Res., 2005. Rep. CU-
CORC-TR-01
194. W. Yin, D. Goldfarb, S. Osher, A comparison of three total variation based texture extraction
models. J. Vis. Commun. Image Represent. 18(3), 240252 (2007)
195. M. Yuan, Y. Lin, Model selection and estimation in regression with grouped variables. J.
Royal Stat Soc 68(1), 4967 (2007). series B
196. X. Yuan, J. Yang, Sparse and low-rank matrix decomposition via alternating direction methods
(Nanjing University, China, 2009). http://math.nju.edu.cn/~jfyang/files/LRSD-09.pdf
197. B. Zhang, X. Cheng, N. Zhang, Y. Cui, Y. Li, Q. Liang, Sparse target counting and localization
in sensor networks based on compressive sensing, in Proceedings IEEE INFOCOM, pp. 2255
2263 (2011)
198. J. Zhang, T. Tan, Brief review of invariant texture analysis methods. Pattern Recogn. 35(3),
735747 (2002)
199. Z. Zhang, A. Ganesh, X. Liang, Y. Ma, TILT: Transform invariant low-rank textures. Int. J.
Comput. Vis. 99(1), 124 (2012)
200. C. Zhao, X. Wu, L. Huang, Y. Yao, Y.C. Chang, Compressed sensing based fingerprint iden-
tification for wireless transmitters. Sci. World J. 19, 2014 (2014)
201. P. Zhao, B. Yu, On model selection consistency of Lasso. J. Mach. Learn. Res. 7, 25412563
(2006)
202. S. Zhou, L. Kong, N. Xiu, New bounds for RIC in compressed sensing. J. Oper. Res. Soc.
China 1(2), 227237 (2013)
203. X. Zhou, W. Yu, Low-rank modeling and its applications in medical image analysis. SPIE
Defense Security Sens. 87500V (2013)
204. Z. Zhou, X. Li, J. Wright, E. Candes, Y. Ma, Stable principal component pursuit, in Proceed-
ings IEEE International Symposium on Information Theory, (ISIT), pp. 15181522 (2010)
205. J. Zhu, D. Baron. Performance regions in compressed sensing from noisy measurements, in
Proceedings IEEE 47th Annual Conference on Information Sciences and Systems, (CISS),
pp. 16 (2013)
206. H. Zou, The adaptive Lasso and its oracle properties. J. Am. Stat. Assoc. 101(476), 14181429
(2006)
207. H. Zou, T. Hastie, Regularization and variable selection via the elastic net. J. Royal Stat. Soc.
67(2), 301320 (2005). series B
Appendix A
Selected Topics of Mathematical
Optimization
A.1 Introduction
This appendix is mostly intended for supporting the chapter on sparse representations.
Therefore, some pertinent optimization topics have been selected for this purpose.
Presently, optimization is a broad discipline covering several types of problems.
The usual way to introduce the subject in education activities, is by finding the points
of an objective function where the gradient is zero, and then using second order deriv-
atives (the Hessian matrix) to classify these points: if the Hessian is positive definite,
then it is a local minimum; if the Hessian is negative definite, it is a local maximum;
if indefinite, it is a kind of saddle point. Of course, this analysis corresponds to twice-
differentiable functions. The points where the gradient is zero are called stationary
points.
In real life, there are constraints to be considered. For instance, one could want
to go non-stop with the car a very long distance, but the fuel tank has a certain size
that poses a limit.
Constrained optimization problems could be converted to unconstrained ones by
using Lagrange multipliers.
When both the objective function and the constraints are linear, it is possible to
apply Linear Programming, which is a very important topic in the world of economy
and industrial production.
As it will be treated in this appendix, there are more general scenarios that can
be considered, based on Quadratic Programming, Convex Programming, Conic Pro-
gramming, Semi-definite Programming, or Non-smooth Optimization.
In some cases, there are non-continuous variables. For instance, in integer pro-
gramming, there are variables that only can take integer values; like for example how
many shoes in a container.
There are several computational techniques that can be used for optimization
purposes; like for instance the famous simplex algorithm (see next subsection). A
number of iterative methods exist, like the Newtons method. Some of these iter-
ative methods are based on evaluating gradients and Hessians, others just evaluate
gradients, and in the case of heuristic methods they simply use function values at
certain points.
In this appendix only a subset of optimization topics have been selected. There
are other topics that have been not mentioned, like, for instance, a classical branch
of optimization theory which is based on variational approaches. Instead of looking
for stationary points, the question is to find functions (extremals) that optimize a
certain criterion. Important names in this context are Euler, Weierstrass, Jacobi,
Legendre, Mayer, Bolza, Pontryagin, etc. An important algorithm belonging to this
area is Dynamic Programming, by Bellman. Typical applications of this algorithm
are shortest path or critical-path scheduling problems.
The next section, devoted to Linear Programming, serves also for the introduction
of some important concepts and terms.
The matters of this appendix are treated at introductory level and trying to include
illustrative examples. Some of the many books that could be recommended for back-
ground are [24, 87].
The term linear programming is due to G.B. Dantzig, who published the simplex
algorithm in 1947. This algorithm is considered as one of the ten most important
algorithms of the previous century. Much of the theory on which it is based was
introduced by Kantorovich in 1939. It is also worth mentioning the work of J. von
Neumann on the theory of duality (also in 1947).
Consider the following optimization problem:
Maximize:
z = c1 x1 + c2 x2 + . . . + cn xn (A.1)
Subject to (constraints):
Perhaps the most important thing to do now, in order to solve the problem, is to adopt
the point of view of geometry. Let us start by saying that the following expression:
a1 x 1 + a2 x 2 + . . . + an x n = b (A.5)
aT x = b (A.6)
a T x = b1 (A.7)
a T x = b2 (A.8)
are parallel.
A hyperplane separates the n-dimensional space into two half spaces.
In the case of an inequality:
aT x b (A.9)
this also separates the n-dimensional space into two half spaces; one of which satisfies
the inequality.
The constraints of the linear programming problem define a set of points x inside
an n-dimensional region formed by the intersection of hyperplanes and half spaces.
For example, suppose the following constraints:
3x1 + 5x2 15
4 x1 + 9 x2 36
x1 , x2 0
Let us plot in Fig. A.1 the region defined by these constraints. This region is called
the feasible solution set.
Usually the feasible solution region is bounded, in which case it is a polygon (2D),
or a polyhedron (3D), or a polytope (nD).
Now, suppose that the function to maximize is:
z = 2 x1 4 x2 (A.10)
266 Appendix A: Selected Topics of Mathematical Optimization
x1
Z=0
Z=6
Z=12
Z=18
x1
subject to the constraints above. Figure A.2 shows some parallel lines obtained for
different values of z.
The key observation is that the maximum of z is attained at a vertex of the feasible
solution region.
Let us say in advance that the simplex algorithm tests adjacent vertices in sequence,
so that at each new vertex the objective function increases, or at least remains
unchanged. Therefore, it is convenient to study in more detail vertices and segments.
A segment between two points x1 and x2 is the set of points x given by:
x = x2 + (1 ) x1 , 0 1 (A.11)
And now a very important concept. A set C is convex if for all x1 , x2 C the
segment between x1 and x2 is composed entirely of points of C.
The hyperplanes and the half-spaces are convex. The intersection of a finite num-
ber of convex sets is convex.
A point x , C is an extreme point of C if there is no segment between two points
of C containing x.
A point x C is an interior point of C if around it there is a ball made entirely
of points of C.
By adding artificial variables, all inequalities can be converted to equalities. This
way, the linear programming problem can be compactly expressed as follows:
Maximize:
z = cT x (A.12)
Appendix A: Selected Topics of Mathematical Optimization 267
Subject to:
Ax = b , x 0 (A.13)
Already one has the following facts. The constraints define a set F of feasible solu-
tions. It is a convex set. The maximum of the objective function takes place on an
extreme point of F.
Now, it can be shown that if there is a set of linearly independent columns of A,
a1 , a2 , ak , (k m), such that:
x1 a1 + x2 a2 + . . . + xk ak = b , x j 0 (A.14)
The basis of a matrix is a maximal linearly independent set of columns of that matrix.
Suppose that a basis of the matrix A has been found, and form the matrix B with the
columns a1 , a2 , am of this basis. Any other column a j of A can be written in function
of the basis:
a j = y1 j a1 + y2 j a2 + . . . + ym j am (A.15)
Or, equivalently:
aj = B yj (A.16)
As seen before, there is an extreme point associated to B, and in this point the
objective function has the value z B . Now, we want to substitute one of the columns
of B by another column of A, so a new basis B was obtained and such that z B z B
(z B is the value of the objective function at the extreme point associated to B ). There
are two questions, what column to extract from B, and what column to take from A.
Consider a partition of the constraint equations:
xB
A x = [B, N ] x = [B, N ] = b (A.17)
xN
Basic feasible solutions of the constraint equations are feasible solutions with no
more than m positive entries. One of these solutions will be such that:
268 Appendix A: Selected Topics of Mathematical Optimization
xN = 0 (A.18)
B xB = b (A.19)
x B1 a1 + x B2 a2 + . . . + x Bm am = b (A.20)
After an algebraic study, the simplex algorithm concludes that the column ar to
extract from B should be such that:
x Br x Bi
= min , yi j > 0 (A.21)
yr j i yi j
This was the answer to the first question. For the second question, denote:
m
zj = ci yi j (A.22)
i=1
The column a j to take from A should be such that it maximizes the following expres-
sion: x Br
(c j z j ) (A.23)
yr j
In practice, linear programming problems are handled with a specific tableau format,
so the algorithm can be applied in a clear systematic way. Initial solutions are easily
forced via artificial variables.
Let us present an example of using the tableau format [105]. The problem to
solve is:
Maximize:
z = 7 x1 + 5 x2 (A.24)
Subject to:
2 x1 + x2 100 (A.25)
4 x1 + 3 x2 240 (A.26)
2 x1 + x2 + s1 = 100 (A.27)
4 x1 + 3 x2 + s2 = 240 (A.28)
cj 7 5 0 0
Basis x1 x2 s1 s2 b
0 s1 2 1 1 0 100
0 s2 4 3 0 1 240
zj 0 0 0 0 0
cj z j 7 5 0 0
Notice that the initial tableau is filled with just the problem statement: the objective
function and the constraints. If all numbers in the last row were zero or negative, the
optimum was reached; but this is not the case, the algorithm should continue.
Now, the effort concentrates in selecting a pivot. The pivot column is where
c j z j is maximum; so it is the x1 column. Now, compute 100/2 and 240/4 (the
numbers 100 and 240 belong to the b column; the numbers 2 and 4 belong to the
pivot column). The smallest non-negative result tell us the pivot row, which in this
case is the s1 row. Therefore, s1 abandons the basis, and is substituted by x1 . The
value of the pivot was 2. The next tableau is the following:
cj 7 5 0 0
Basis x1 x2 s1 s2 b
7 x1 1 1/2 1/2 0 50
0 s2 0 1 -2 1 40
zj 7 7/2 7/2 0 350
cj z j 0 3/2 -7/2 0
The x1 row has been obtained from the s1 row in the previous tableau, dividing it
by the pivot.
Take the kth entry of the new s2 row. It has been obtained from the kth entry of
the old s2 row minus a quantity u k ; where u k is the product of the number of the old
s2 row which is below the pivot (4), and the k th entry of the new x1 row. That is:
The values of the z j row are obtained from the column of c j and the x1 , x2 , s1 , s2 ,
columns as follows:
The next pivot would be the 1 at the intersection of the x2 column and the s2 row. A
third tableau should be computed, which results to be the final one with the optimum.
Since this is a brief summary, no space has been conceded to some special situa-
tions that may arise, concerning some steps of the algorithm. Also, there are revised
versions of the basic algorithm that would deserve a more detailed study. Fortunately,
270 Appendix A: Selected Topics of Mathematical Optimization
there is a vast literature on linear programming that the interested reader may easily
reach [29, 40, 112]
A word of caution: the term simplex is also used to designate a set of n + 1 points
in a n-dimensional space. There are iterative optimization methods based on the use
of simplexes. These methods should not be confused with the simplex algorithm just
described.
One of the first papers proposing the use of linear programming for the design of
FIR filters was [90] in 1972.
The main idea for writing the design problem in linear programming for-
mat was to discretize the desired frequency response along a set of frequencies
= {1 , 2 , m }. A dense frequency sampling (not necessarily regularly spaced)
is recommended.
The frequency response of a FIR filter can be written as follows:
N
H () = h 0 + 2 h k cos( k) (A.29)
k=1
The minimization target could be specified as follows: find the FIR filter coeffi-
cients h k such that is minimized, for 0 , |E()| < .
According with [102] the problem can now be specified as:
Minimize:
(A.31)
Subject to:
H (k ) D(k ), 1 k m (A.32)
W (k )
(H (k )) D(k ), 1 k m (A.33)
W (k )
More or less connected with this approach, a number of proposals have been made
for using linear programming or other optimization methods for the design of FIR
and IIR filters, [28, 91, 113]. In [8] linear programming is applied for the design of
sparse filters.
Appendix A: Selected Topics of Mathematical Optimization 271
x1
x1
272 Appendix A: Selected Topics of Mathematical Optimization
x1
Finally, Fig. A.5 shows a case with infinite (good) solutions, since the optimal
objective function is coincident with a line segment joining two vertices.
Just from these first examples it may be clearly sensed that the deep mathematical
analysis of optimization topics requires a strong basis of topology.
Interior point methods became quite popular after 1984, when Karmarkar [60]
announced a fast polynomial-time method for linear programming. In fact, the com-
ment of specialists was that an interior-point revolution emerged from then. It also
came to headlines of newspapers, like The New York Times or The Wall Street
Journal.
The basic method is iterative; it starts from an interior point of F and then it
uses a modified Newtons method to reach the solution. Along the algorithm steps, a
primal-dual optimization problem is treated. More details of these aspects are given
now, as preliminaries.
A.3.1 Preliminaries
Most of the concepts to be briefly reviewed are classic, so more details are easy to
find in the literature.
f (x) = 0 (A.34)
f (x (k) )
x (k+1) = x (k) + x (k) = x (k) (A.37)
f (x (k) )
When the method converges, it does so quadratically. Of course, the method can be
effortlessly generalized for n dimensions, using gradients.
Suppose you want to find the minimum of a function g(x) using the Newtons
method. Then, you take f (x) = g (x). In case of n variables, with g(x1 , x2 , . . . , xn ),
the Newton step would be:
g(x)
x = 2 (A.38)
g(x)
Subject to:
gi (x) = 0 , i = 1, 2, . . . , m (A.40)
The idea of Lagrange for solving this problem is to use a set of multipliers i to build
the following function:
m
L(x, ) = f (x) i gi (x) (A.41)
i=1
And then minimize this function, which is called the Lagrangian function, by solving
the following equation system:
L f m
gi
= i = 0 , j = 1, 2, . . . , n (A.42)
x j x j i=1
x j
274 Appendix A: Selected Topics of Mathematical Optimization
L
= gi (x) = 0 , i = 1, 2, . . . , m (A.43)
i
Let us take into account inequalities in the constraints, so the problem becomes:
Minimize:
f (x) (A.44)
Subject to:
h i (x) 0 , i = 1, 2, . . . , m (A.45)
li (x) = 0 , i = 1, 2, . . . , r (A.46)
m
r
L(x, u, v) = f (x) + u i h i (x) + vi li (x) (A.47)
i=1 i=1
The infimum is used instead of the minimum, because the Lagrangian might not have
a minimum on F).
An important fact is the following: denote as f the solution of the primal problem;
then q(u, v) is always a lower bound of f . This property is called weak duality.
Then, it would make sense to look for a set of values (u , v ) such that the
Lagrangian dual function was maximized. This leads to the Lagrange dual problem:
Maximize:
q(u, v) (A.50)
Subject to:
u0 (A.51)
Appendix A: Selected Topics of Mathematical Optimization 275
The dual problem is always convex. The primal and dual optimal values, f and
q , always satisfy weak duality: f q .
The difference:
(x, u, v) = f (x) q(u, v) (A.52)
and
m
r
f (x) = q(u , v ) = inf ( f (x) + u i h i (x) + vi li (x))
x i=1 i=1
m (A.53)
f (x ) + u i h i (x ) f (x )
i=1
m
Then: u i h i (x ) = 0. This means that:
i=1
u i h i (x ) = 0, i = 1, 2, . . . , m (A.54)
h i (x ) 0 , i = 1, 2, . . . , m (A.55)
li (x ) = 0 , i = 1, 2, . . . , r (A.56)
u i 0 , i = 1, 2, . . . , m (A.57)
u i h i (x ) = 0 , i = 1, 2, . . . , m (A.58)
276 Appendix A: Selected Topics of Mathematical Optimization
m
r
f (x ) + u i h i (x ) + vi li (x )) = 0 (A.59)
i=1 i=1
Conversely, if the problem is convex and x , u , v satisfy the KKT conditions, then
these values are (primal, dual) optimal.
Time ago the above conditions were named KT conditions since they appeared in
a publication of H.W. Kuhn and A.W. Tucker, in 1951. Later on, it was discovered
that they were stated in the unpublished Masters Thesis of W. Karush in 1939.
A.3.1.5 Examples
Next two examples were taken from [13]. The first example is a knapsack problem:
Minimize: z = 3 x1 + 4 x2 + 5 x3 + 5 x4
Subject to: 7x1 + 5 x2 + 4 x3 + 3 x4 17 ; xi {0, 1}
The Lagrangian would be:
L(x, ) = (3 7) x1 + (4 5) x2 + (5 4) x3 + (5 3) x4 + 17
The largest value of L would be: L(5/3) = 12 + (5/3). This is the best (highest)
lower bound of the optimization problem. Actually, the solution of this problem is
z = 12 + 5/3 at the point x = (1, 1, 1, 1/3). Therefore, in this case the best lower
bound was equal to the optimum value.
The second example considers a nonlinear programming problem:
Minimize: z = x 2 + 2 y 2
Subject to:
x + y 3
y x2 1
Appendix A: Selected Topics of Mathematical Optimization 277
L(x, ) = x 2 + 2y 2 1 (x + y 3) 2 (y x 2 1)
x+y 3
y x2 1
1 (x + y 3) = 0
2 (y x 2 1) = 0
L
= 2x 1 + 22 x = 0
x
L
= 4y 1 2 = 0
y
1 , 2 0
(x + y 3) = 0
(y x 2 1) = 0
gives two possible solutions of (x, y): (2, 5) or (1, 2), of which only (1, 2) can
satisfy the other equations. Therefore, the optimum is found at the point (1, 2), being
z = 9.
B
= 2(x1 + 1) =0 (A.61)
x1 x1
B
= 2(x2 + 1) =0 (A.62)
x2 x2
x1 () = x2 () = + 1 + 2 (A.63)
2 2
Notice that the solution tends to (0, 0) as 0. This is a general result of the barrier
function method [37]. In iterative schemes, there would be a sequence of solutions
x (1 ), x (2 ), . . . , x (n ) as 0. This sequence is called the central trajectory
or the central path.
In particular, in the case of linear programming, the optimal solution will be
located at the boundary of F. The central path would be a series of interior points,
tending to this optimal solution at the boundary.
In the general case of a B(x|) to be minimized, the set of equations B/xi =
0, i = 1, . . . , m would be written, and then one could use the Newtons method to
solve them.
The barrier function technique for linear programming was introduced in 1967,
in the [37] book with subtitle: Sequential Unconstrained Minimization Techniques
(also known as SUMT).
Let us apply the mechanisms already introduced to the linear programming problem.
The primal problem was:
Maximize c T x subject to A x = b , x 0.
The dual problem would be:
Maximize bT subject to A T + s = c , s 0.
Appendix A: Selected Topics of Mathematical Optimization 279
m
L p (x, ) = c T x log(xi ) T (Ax b) (A.64)
i=1
n
L d (x, , s) = b +
T
log(si ) x T (A T + s c) (A.65)
i=1
After setting the derivatives of the Lagrangians to zero, one obtains just three equa-
tions:
Ax = b (A.66)
AT + s = c (A.67)
xi si = , i = 1, . . . , n (A.68)
(with x 0, s 0)
In order to simplify the notation, the following diagonal matrices will be used:
X = diag{x1 , x2 , . . . , xn } (A.69)
S = diag{s1 , s2 , . . . , sn } (A.70)
X Se = e (A.71)
where: eT = [1, 1, . . . , 1]
Hence, one could apply the Newtons method to solve the following equations:
Ax b = 0 (A.72)
AT + s c = 0 (A.73)
X S e e = 0, (A.74)
In the case of f(x) = 0 (a vector function f of several variables), the Newton step
can be computed from:
J (x(k) ) x = f(x(k) ) (A.75)
After obtaining the Jacobian of the three equations given before, one has:
A 0 0 x 0
0 AT I = 0 (A.76)
S0 X s e X Ze
This equation gives a Newton step. However, the actual next iterate to be applied
would be:
(x, , s) + (x, , s) (A.77)
n
(x, s) = log x T s log xi si (A.78)
i=1
with > n.
Karmarkars algorithm is a potential-reduction algorithm, based on the following
potential function:
n
(x) = log(c x Z )
T
log xi (A.79)
i=1
It has been noticed in many applications that interior-point algorithms only need a
few tens of iterations to obtain good approximations of the optimum, being sufficient
to terminate the iteration process.
An extensive academic exposition of the interior point methodology is provided
by [44]. Other publications of interest would be [49, 63, 64], the book [93] and the
tutorial [20].
z T = Q x + c (A.82)
x = Q 1 c (A.83)
x2
M
x1
Ax = 0 (A.84)
To solve the optimization problem, the following simple method could be applied.
First, a change of variables:
Zy =x (A.85)
Z T Q Z y = Z T c
1 T
L(x, ) = x Q x + c T x T (Ax b) (A.86)
2
Taking partial derivatives, one obtains the following equations:
Q x + c + AT = 0 (A.87)
Appendix A: Selected Topics of Mathematical Optimization 283
Ax = b (A.88)
The matrix at the left hand side of (A.89) is called the KKT matrix. Assume that
A is full row rank m < n, and that the reduced Hessian is positive definite; then
the KKT matrix is nonsingular, and so the equations (A.89) have a unique solution,
which is the optimum.
There is a number of approaches for the case in which the constraints were:
Ax b , x 0 (A.90)
The linear equations provided by the derivatives of the Lagrangian can be used to
establish a kind of linear programming problem, so the simplex algorithm can be
applied in a special way. These equations would be:
Q x + c + AT 0 (A.91)
Ax b (A.92)
The idea is to use two sets of artificial variables for creating a linear problem.
The first set is added to convert equations into equalities. The second set is added to
obtain a linear objective function.
For instance, consider the following problem [56]:
Minimize: z = 8 x1 16 x2 + x12 + 4 x22
Subject to: x1 + x2 5 and x1 3
with all variables 0
After building the Lagrangian and taking derivatives, one obtains:
2 x1 + 1 + 2 e1 = 8
8 x2 + 1 e2 = 16
284 Appendix A: Selected Topics of Mathematical Optimization
x 1 + x 2 + 1 = 5
x 1 + 2 = 3
Notice that a first set of artificial variables e1 , e2 , 1 , 2 has been added to obtain
the equalities.
Now, more variables are added and the following LP problem is formulated:
Minimize: z = a1 + a2 + a3 + a4
Subject to:
2 x1 + 1 + 2 e1 + a1 = 8
8 x2 + 1 e2 + a2 = 16
x 1 + x 2 + 1 + a 3 = 5
x 1 + 2 + a 4 = 3
And then the simplex algorithm is applied taking into account complementary slack-
ness:
x j and e j are complementary
j and j are complementary
Recall that during the simplex steps, when exchanging columns, a maximization
criterion is applied for choosing the entering variable (column). In the present case
(Wolfes method), the entering variable should not have its complementary variable
in the basis (or would leave the basis on the same iteration). Therefore, in each step,
some variables must be excluded from the maximization criterion.
The history of the simplex method application to this example can be summarized
as follows:
holds for some of the equations in Ax b; the set of these equations (constraints)
is called the active set, which will be denoted as .
The active set method iteratively changes the set of constraints that are active,
which are contained in a working set W K . In each step a constraint is added or
removed from W K , to decrease the objective function.
For each step k, the constraints outside W K are omitted and then an equality QP
is solved. The solution x W give us a direction to follow:
p K = x W x(K 1)
A B
x1
286 Appendix A: Selected Topics of Mathematical Optimization
Therefore, p1 = 0. Since all Lagrange multipliers are negative, this is not the opti-
mum. One of the constraints, (C), is removed, so W1 = {E}. A new equality QP
should be solved, whose KKT equations are:
2 0 0 2
0 2 1 x
= 5
0 1 0 2
Since (2) is negative, this is not the optimum. Another constraint is removed, so
W2 = empt y. Therefore:
2 0 2 1
(x) = , xW =
0 2 5 2. 5
Appendix A: Selected Topics of Mathematical Optimization 287
1 1 0
The new search direction would be: p3 = =
2.5 0 2.5
It is not possible to set to one, since constraint (A) would not be satisfied.
Actually, from:
x1 2 x2 + 2 0(A)
D
C
E (1,0) (2,0) x1
X (2) X (0) , X(1)
288 Appendix A: Selected Topics of Mathematical Optimization
This section has been introduced because some of the techniques related to integer
programming have connections with methods employed in sparse representation
problems.
Maximize:
z = cT x (A.93)
Subject to:
Ax b , x 0 (A.94)
Subject to:
Ax = b , x 0 (A.96)
Subject to:
x1 + x2 6
5 x1 + 9 x2 45
1 2 3 4 5 6 x1
2
New constraint
1
1 2 3 4 5 6 x1
If the solution was an integer, then thats all. But no, the solution takes place at
the point x = (2.25, 3.75), so it is not integer.
Let us generate a cut, which is a constraint that is satisfied by all feasible integers,
but not by x . For example:
2x1 + 3x2 15 (A.97)
a1 x 1 + a2 x 2 + . . . + an x n = b (A.98)
[|a1 | + (a1 |a1 |)] x1 + . . . + [|an | + (an |an |)] xn = [|b| + (b |b|)]
(A.99)
290 Appendix A: Selected Topics of Mathematical Optimization
Therefore, the left-hand side of (A.101) must be an integer; and, since 0 f < 1, it
must be non-negative.
Based on these observations, adequate constraints could be added to the simplex
algorithm in order to introduce a cut. For example, suppose you arrived to a non
integer optimal in a certain problem, having the following simplified tableau:
Basis x1 x2 s1 s2 b
x1 1 0 0.6 0.4 3.2
x2 0 1 0.4 0.6 3.2
zj 1.4 1 0 0
cj z j 0 0 0.2 0.2
Select for instance the x1 row. The corresponding constraint would be:
In other terms:
x1 3 = 0.2 + 0.6 s1 0.4 s2 (A.103)
This constraint is satisfied by every feasible integer solution, but is not feasible for the
current optimal solution (which includes s1 = 0; s2 = 0). By adding this constraint
to the problem, a different optimal solution would be reached, which might be integer.
1 2 3 x1
Maximize:
z = x1 + x2
Subject to:
x2 x1 2
8x1 + 2x2 19
Subject to:
4x1 + 2 x2 15
x1 + 2x2 8
x1 + x2 5
By using the simplex algorithm, one finds the optimum at (2.5, 2.5) with z = 12.
Let us branch on x1 . Two sub-problems are obtained:
The first, subject to:
4x1 + 2 x2 15
x1 + 2x2 8
x1 + x2 5
x1 3
The second, subject to:
4x1 + 2 x2 15
x1 + 2x2 8
x1 + x2 5
x1 2
Suppose one chooses the first sub-problem. Using the simplex algorithm for this
LP, the solution is (3, 1.5), which is not valid. Then, let us branch on x2 . Two more
sub-problems are obtained:
x1 + 2x2 8
x1 + x2 5
x1 3
x2 2
The fourth, subject to:
4x1 + 2 x2 15
x1 + 2x2 8
x1 + x2 5
x1 2
x2 1
Appendix A: Selected Topics of Mathematical Optimization 293
Problem 0
Sub-probl
Sub- probl
2
1
Z=12
Sub-probl
Sub-probl
3
4
unfeasible
Sub-probl Sub-probl
5 6
unfeasible Z=11
The third sub-problem is unfeasible. The solution for the fourth sub-problem is
(3.25, 1), not valid. If one takes the fourth sub-problem and branches on x1 , two
more sub-problems are generated, the 5th with x1 4, and the 6th with x1 3. The
5th sub-problem is unfeasible, while the 6th gives the solution (3, 1) with z = 11.
This is a lower bound.
Let us come back, and see the second sub-problem. The optimal solution found
with the simplex algorithm is (2, 3) with z = 12. This is the optimum.
Figure A.12 summarizes the process that has been followed in this example.
In general there are two exploration alternatives. In the example, the depth-first
alternative has been chosen. The other alternative was breadth-first search. Had we
chosen this second way, the searching had been much shorter (just the sub-problems
1 and 2).
There are several web sites on Integer Programming (see the Resources section).
One of the good tutorials available from Internet is [71]. Most cited papers on this
topic are [43, 38]. The article [67] offers a review of fifty years of solution approaches.
In addition to LP, QP and IP problems there are much more types of optimization
problems waiting for a solution. This is well-known in the industrial and economics
worlds. In the last decades, the range of problems that now can be tackled has
substantially increased. Large part of this advancement is due to the three fields
294 Appendix A: Selected Topics of Mathematical Optimization
What it is important in this expression (A.105) is that the third derivative is bounded
at every point by the second derivative.
A main idea of [85] was to replace the inequality constraints by self-concordant
barrier terms in the objective function. By using self-concordant functions one can
define closeness to the central path and the policy to update in order that the Newton
steps stay close to the central path, with not much fluctuation (this is implied by the
derivative bounds). The result is a better convergence of the interior-point method.
All linear and quadratic functions are self-concordant. The sum of two self-
concordant functions is also self-concordant. An important example of self-concor-
dant function is f (t) = log(t), for which the expression (A.105) holds with
equality.
An interesting property of self-concordant functions is that they inform on how
close to an optimal point is the current algorithm step, and you can obtain a upper
Appendix A: Selected Topics of Mathematical Optimization 295
bound on the number of iterations required to approach the optimum within a certain
.
During last years is being realized that better schemes can be obtained, based on
barrier functions other than the logarithmic one [7].
Subject to:
f i (x) 0, i = 1, 2, . . . , m (A.107)
h j (x) = 0, j = 1, 2, . . . , r (A.108)
xS (A.109)
Convex
Not-convex
296 Appendix A: Selected Topics of Mathematical Optimization
epigraph
f(x)
Given a hyperplane a T x and a convex set S, it is said that the hyperplane supports
S at x0 if all points x of S satisfy: x S a T x a T x0 . The hyperplane is called
supporting hyperplane, and is tangent to S. Figure A.15 depicts an example.
The supporting hyperplane theorem establishes that there exists a supporting
hyperplane at every point of the boundary of a convex set.
The epigraph of a function f (x) is the set (x, y) such that y f (x). Figure A.16
depicts the epigraph of a function f (x); the epigraph being the area above the curve
f (x).
Appendix A: Selected Topics of Mathematical Optimization 297
Similarly, the hypograph of the function f (x) is the area below the curve.
The function f (x) is a convex function if its epigraph is a convex set.
The convexity of a differentiable function can also be characterized by its gradient
and Hessian. Based on the first order Taylor approximation a first order condition
can be enounced as follows, f (x) is convex if:
x K x K , 0 (A.111)
Figure A.17 shows some examples of 2D cones [118]. Obviously, one of them is
not convex.
The intersection of two cones is a cone.
A set K is a convex cone if K is a cone and K is convex.
298 Appendix A: Selected Topics of Mathematical Optimization
K
K
K = {x | xi 0, i = 1 = 1, 2 . . . , n} (A.112)
K = {A S n | uT A u 0, u n } (A.113)
Norm cones:
K = {(x, t) | x t} (A.114)
An example of norm
cones is the second order cone (SOC), which corresponds to
x = x2 (that is: x 2 + y 2 z). Figure A.18 depicts a second order cone. This
cone is also called the Lorentz cone or, more colloquially, the ice-cream cone.
Evidently, a particular case of convex programming is the following problem:
Minimize:
f 0 (x) (A.115)
Subject to:
f i (x) 0, i = 1, 2, . . . , m (A.116)
h j (x) = 0, j = 1, 2, . . . , r (A.117)
xK (A.118)
Appendix A: Selected Topics of Mathematical Optimization 299
K*
300 Appendix A: Selected Topics of Mathematical Optimization
Inf: c T x
Subject to: Ax = b , x K
If K is a solid, pointed, closed convex cone, the dual problem would be:
Sup: bT y
Subject to: A T y + s = c , s K
The dual of the dual problem is equivalent to the primal problem.
Clearly, it is advantageous to deal with scenarios based on self-dual cones.
Once the conic programming fundamentals has been introduced, it seems oppor-
tune to complete the view and add some observations.
It can be shown that any convex programme can be written as a conic programming
[36]. So conic programming is important, and it provides also a convenient framework
from the mathematical point of view.
In geometry courses, an important lesson is the one devoted to the different curves
generated by the intersection of a plane and a cone: the conic sections. In the case
of conic programming this is extended to n-dimensions. The feasible region can be
the intersection of polyhedrals, ellipsoids, paraboloids, and hyperboloids. Quadratic
programming can be easily treated by adding a new variable and considering the
quadratic objective function as included in the set of constraints.
There are some peculiarities to have in mind. For instance, the problem:
min{x3 | x1 = 1, x2 = 1, x3 x12 + x22 } (A.120)
has an irrational solution: x3 = 2
Another case:
inf{x3 x2 | x1 = 1, x3 x12 + x22 } (A.121)
For all x n , x T A x 0
All eigenvalues of A are positive.
The set of positive semidefinite matrices is denoted as S+n .
A matrix P is positive definite if for all x n , x T P x > 0 , x
= 0.
n
The set of positive definite matrices is denoted as S++ .
All eigenvalues of a symmetric matrix are real. The corresponding eigenvectors
can be chosen so that they are orthogonal. The determinant of the matrix is the product
of the eigenvalues; and the sum of all entries is equal to the sum of the eigenvalues
(the trace of the matrix).
Subject to:
F(x) 0 (A.123)
where:
m
F(x) = F0 + xi Fi (A.124)
i=1
In this way, a SDP formulation has been obtained. Notice that the matrix has a
block-diagonal form. The matrix inequality:
t cT x
0 (A.125)
cT x dT x
(c T x)2
is equivalent to dT x 0 and t dT x
0.
inequality (A.125).
Appendix A: Selected Topics of Mathematical Optimization 303
Subject to:
tr (Fi Z ) = ci , i = 1, 2, . . . , m (A.127)
Z 0 (A.128)
where tr ( . ) is the trace of the matrix, and Z = Z T is the variable added for the
Lagrangian.
The duality gap would be:
n
n
C, X = ci j xi j (A.130)
i=1 j=1
X 0 (A.133)
The SDP problem is a conic programming case, since the set of positive semidef-
inite matrices X is a cone.
The variable X :
x11 x12 x13
X = x21 x22 x23
x31 x32 x33
Therefore:
C, X = x11 + 2x12 + 3x13 + 2x21 + 9x22 + 0x23 + 3x31 + 0x32 + 7x33
All these polynomials can be further simplified taking into account that X is sym-
metric.
The dual problem of the above primal SDP problem would be:
Maximize:
m
yi bi (A.134)
i=1
Subject to:
m
yi Ai + S = C (A.135)
i=1
S0 (A.136)
Subject to:
1 0 1 0 2 8 1 2 3
y1 0 3 7 + y2 2 6 0 + S = 2 9 0 (A.138)
1 7 5 8 0 4 3 0 7
S0 (A.139)
Appendix A: Selected Topics of Mathematical Optimization 305
Given a feasible solution X of the primal and a feasible solution (y, S) of the dual,
the duality gap is:
m
C, X yi bi = S, X 0 (A.141)
i=1
If the duality gap was zero, then the solutions X and (y, S) would be optimal solutions
of the primal and the dual problem respectively.
Subject to:
m
C yi Ai 0 (A.143)
i=1
and it happens that this is also a positive semidefinite problem, similar to:
Minimize:
cT x (A.144)
Subject to:
F(x) 0 (A.145)
with:
m
F(x) = F0 + xi Fi (A.146)
i=1
306 Appendix A: Selected Topics of Mathematical Optimization
In most SDP problems the optimum of the primal problem is coincident with the
optimum of the dual problem, so the duality gap is zero. A combined view of the
optimization problem could be to consider:
Minimize:
= c T x + tr (F0 Z ) (A.147)
Subject to:
F(x) 0, Z 0, (A.148)
tr (Fi Z ) = ci , i = 1, 2, . . . , m (A.149)
This form, which intends to minimize the duality gap , is called the primal-dual
optimization approach.
F
Appendix A: Selected Topics of Mathematical Optimization 307
n
1 n
1
(x) = a i = ai (A.154)
i=1
aiT x + bi i=1
r i ai
tr (F(x A )1 Fi ) = 0, i = 1, 2, . . . , m (A.156)
So the analytic center is the feasible point that maximizes the product of distances to
the constraint hyperplanes. It is also the equilibrium point of the repulsive forces.
The Newtons method can be used to compute the analytic center of a given set of
constraints, departing from an initial strictly feasible point (a point such F(x) > 0).
The Newton direction would be [111]:
m
(n) 1/2 1/2
x = arg min I + i F Fi F (A.158)
i=1 F
A F = (tr (A T A))1/2 = ( i j A21 j )1/2
The iterative procedure will be:
cT x = (A.160)
Figure A.22 depicts the situation. The line c T x = p crosses the point O, and the line
c T x = p crosses the point P (the optimal).
The analytic center corresponding to (A.159) and (A.160), would be:
O
Appendix A: Selected Topics of Mathematical Optimization 309
The curve described by x A (), for : p p , is the central path, which is the
curve (dash-point-dash) that in the figure comes from O to P.
The solution of (A.161) satisfies:
LP QP SOCP SDP
It would also be recommended to visit the web site of El Ghaoui (see the resources
section).
A.6.4 Duality
It is usual, when using interior-point methods, to deal with dual problems. This is one
of the reasons that have motivated much interest on duality issues. In this context, it
is illuminating to consider the Fenchels duality theorem, which we introduce now
by means of two simple figures.
The scenario consists of a convex and a concave curve. The problem is to find the
minimum vertical distance between the two curves. Figure A.23 depicts the case.
A second problem is to find two separated parallel tangents, so their distance is
maximal. Figure A.24 depicts this other case.
The Fenchels duality theorem establishes that he points having the minimal ver-
tical separation are also the tangency points for the maximally separated parallel
tangents.
Let us devote some efforts to formalize this result. Of course, what makes it
interesting for us is because of primal-dual solutions and duality gap issues.
310 Appendix A: Selected Topics of Mathematical Optimization
f(x)
The reader may suspect that mathematical difficulties may arise when functions are
not convex and/or non-differentiable. Of course, it is not our mission here to try a
complete study of this question; but it is convenient to include some hints.
For instance, Fig. A.26 shows how the intersection of two differentiable functions
may give a non-differentiable (non-smooth) function.
Given a function f (x) it is said that it admits a supporting line at x, if there exist
a parameter p such that:
a c b d x
Figure A.27 depicts a curve with a number of interesting points. Two supporting
lines have been plotted (dash-point-dash lines).
Let us do the following remarks:
For the general cases, it can be shown that if f (x) admits a supporting line at
x with slope k, then f (k) admits a supporting line at k with slope x. Moreover, if
f (x) admits a strictly supporting line at x with slope k, then f (k) admits a tangent
line at k with slope f (k) = x (therefore, f (k) is differentiable).
It also can be shown that f (x) = f (x) iff f (x) admits a supporting line at x.
Also, if f (k) is differentiable at k, then f (x) = f (x) at x = f (k). Indeed, if
f (k) is everywhere differentiable, then f (x) = f (x) for all x.
An important property of the LF transform is that f (k) and f (x) are convex
functions (they are U-shaped).
Two interesting properties:
A convex function can always be written as the LF transform of another function.
f (x) is the largest convex function that satisfies: f (x) f (x).
And finally, concerning non-differentiable points, suppose that f (x) has one of
these points (like the one represented in Fig. A.26) and denote it as xC . This point
admits infinitely many supporting lines, with slopes in the range k1 . . . k2 . The cor-
responding f (k) would be a line of constant slope xC in the interval k1 . . . k2 .
Figure A.28 depicts an example.
See [108] for more details on the LF transform.
Appendix A: Selected Topics of Mathematical Optimization 313
f(x) f*(k)
xc x k1 k2 k
Let us express the Fenchels duality theorem using the formalism already introduced.
Suppose that f (x) is a proper convex function, g(x) is a proper concave function,
and regularity conditions are satisfied. Then:
where:
f (a) = sup(a T x f (x)) (A.169)
x
The function f (x) is called the convex conjugate of f (x); and the function g (x) is
called the concave conjugate of g(x).
If f (x) is convex and differentiable, one can consider the Legendre transform,
d d f (x)
(a x f (x)) = a =0 (A.174)
dx dx
The usual procedure is to write:
d f (x)
a= (A.175)
dx
and to obtain from this equation x as a function of a. Then:
Another definition of Legendre transform says that f (x) and f (a) are Legendre
transforms of each other if D( f (x)) = (D( f (a))1 (where D means derivative).
Actually, from the definition it can be derived that:
d f (x) d f (a)
a= ; x= (A.177)
dx da
These expressions make you recall Hamiltonian mechanics and other main topics
of Physics.
Notice that based on (A.175), in the n-dimensional case, one can write:
f (a) can represent the gain, a is the price, x the level of production, and f (x) the
cost of this production. It may happen that the price depends on x, so our formula
should be modified. This is in accordance with the initiative of Moreau, who extended
the Fenchels conjugation using coupling functions:
(Moreaus conjugate)
Appendix A: Selected Topics of Mathematical Optimization 315
A.6.4.4 Saddle-Value
The variables x and y belong to certain sets X and Y . Two optimization problems
were considered:
minimize f (x) over all x X
maximize g(x) over all y Y
Clearly:
f (x) K (x, y) g(y) (A.182)
And then:
inf f (x) sup g(y) (A.183)
x y
If equality holds, the common value is called the saddle-value of K , which exists if
there is a saddle-point of K . A point (xs , ys ) is a saddle-point of K if:
A rapid overview of conjugate function examples may help for an evaluation of the
duality approach usefulness.
Let us start with the indicator function and norms.
316 Appendix A: Selected Topics of Mathematical Optimization
f(x) f*(a)
x -1 1 a
with: 1
p
+ 1
q
= 1
Now, let us devote some space for curves.
Suppose that f (x) is a parabola: f (x) = x 2 . Its conjugate is: f (a) = 14 a 2 .
Figure A.30 visualizes this case.
f(x) f*(a)
x a
1 T
f (x) = x Ax (A.187)
2
the conjugate is:
1 T 1
f (a) = a A a (A.188)
2
A more general case could be:
1 T
f (x) = x A x + cT x + b (A.189)
2
the conjugate is:
1
f (a) = (a c)T A1 (a c) + c T A1 (a c) + b (A.190)
2
The conjugate of the exponential function f (x) = e x is:
Then:
Moreover, if a (primal) constraint qualification is satisfied, such as 0 int (dom g
A dom f ), then: = . The unique solution x M of the primal problem can
be derived from a (non-necessarily unique) solution y M of the dual problem as:
x M = f (A y M ) (A.199)
In [77] this duality was further extended to the context of saddle functions and dual
mini-max problems.
As the reader surely had noticed, some of the previous expressions involve a chaining
of min (or inf) and max (or sup). Actually this is connected with mini-max results,
which in turn are representative of game theory [115].
Let us select some theorems.
The first (weak duality) says that:
The conditions can be relaxed a bit, so a third theorem (strong duality) due to Sion in
1958, establishes that given a g(x, y), lower semi-continuous quasi-convex on x X
and upper semi-continuous quasi-concave on y Y , with X and Y convex and one
of them compact, then:
There are several theorems and mathematical results in connection with the mini-
max context. For instance, [39] includes up to 14 mini-max theorems, showing that
they form an equivalent chain.
In a very enjoyable article [26], the authors describe an historic encounter of
von Neumann and Dantzig in 1947, at Princeton. Those were the times when it was
conjectured that close relationships existed between game theory, duality, and linear
programming. Now it is well known, see for instance [1], that any zero-sum game
can be reduced to a linear programming problem (and it can be used to prove the
mini-max theorem based on strong duality), and vice-versa: a linear programming
problem can be reduced to a zero-sum game.
In order to give some more details on this aspect, we could introduce a couple of
rapid examples.
A standard problem is the prisoners dilemma, which can be represented with the
following tables (Figs. A.31 and A.32).
In each cell, the payoff of each prisoner is given in terms of years reduction.
The sum of payoffs in each cell of a zero-sum game is zero. That means, in a
two-person game, that if one player wins 100, then the other player loses 100. This
type of games can be represented by just one payoff matrix (the other having term
1 35000 40000
Player
(row)
2 30000 20000
320 Appendix A: Selected Topics of Mathematical Optimization
by term the opposite sign). Take for example the case of a negotiation between a
football player and his employer:
Usually, zero-sum tables put on the left the agent that wants to maximize the
outcome (choosing between strategies 1 or 2), and on top the agent that wants to
minimize it (selecting strategy A or B). If the player chooses 1, the employer will
choose A. If the player chooses 2, the employer will choose B. The best for the athlete
would be 1-A. In mathematical terms:
And so, a mini-max result has been found. The entry 1-A, with the value 35,000,
is a saddle-point of the game. Because of the saddle-point, neither player can take
advantage of the rivals strategy; it is said that one has a stable solution (or equilibrium
solution).
Some games do not possess a saddle-point. In such a case, players should avoid a
predictable strategy, which may suppose an advantage for the opponent. Hence, each
player should choose at random among their alternatives, according with some prob-
ability distributions x = (x1 , x2 , . . . , xn ), of player 1, and y = (y1 , y2 , . . . , ym ), of
player 2. This is called mixed-strategies games. The equivalence between games
and linear programming was originally established in this context [26]. In particular,
given the LP problem:
min c T x subject to Ax b, x 0
and the dual:
max bT y subject to A T y c, y 0
Dantzig suggested to reduce this pair of LP problems to a symmetric zero-sum
game by using the following payoff matrix:
0 A b
P = A T 0 c (A.205)
bT c T 0
The Farkaslemma can be expressed in several ways [55, 99]. In geometrical terms, it
says that a vector is either in a given convex cone, or there is a hyperplane separating
the vector from the cone. The lemma can be regarded as a particular case of the
separating hyperplane theorem.
Another formulation is the following: let A be a m n matrix and b a column
vector of size n. Only one of the next alternatives holds:
There exists x n such that A x = b and x 0
There exists y m such that A T y 0 and bT y < 0
Note that this second alternative is equivalent to:
There exists y m such that A T y 0 and bT y > 0
An important application of the lemma is for the analysis of Linear Programming
problems, concerning in particular feasibility aspects.
For example, in the case of the LP problem:
min c T x subject to Ax b, x 0
and the dual:
max bT y subject to A T y c, y 0
The Farkas lemma implies that exactly one of the following cases occurs:
Both the primal and the dual have optimal solutions x and y with equal values
c T x = b T y
The dual is unfeasible, and the primal is unbounded (c T x )
The primal is unfeasible, and the dual is unbounded (bT y )
Both primal and dual are unfeasible
The lemma can be used as a certificate of infeasibility [6]. In particular, the primal
problem:
min c T x subject to Ax b, x 0
is unfeasible if and only if
The use of gradients is important in signal processing, in particular for iterative adap-
tation and/or optimization schemes. This is evident from the widespread presence of
gradients in the chapters of this book.
The iterative gradient descent method can be summarized with a simple equation:
where gk = f (xk ).
As asserted by [11, 48], gradient-based methods are attracting a renewed interest,
since they can be competitive for large scale optimization problems [25]. A main
effort nowadays is devoted to obtain first-order methods able to capture curvature
information. First order methods only use derivatives; while second order methods
use the Hessian (which effectively obtains curvature information, at the price of
higher complexity).
This subsection focuses on first order methods, and the next subsection extends
the view to Newton related methods.
A natural analogy can be established between steepest ascent and hill climbing, as
already mentioned in this book. The basic idea is to choose in each step the direction
of largest ascent. However this may drive you to a stationary point which is not the
highest peak, but a local maximum instead. It is pertinent to determineas far as
possibleif a function to be optimized offers good opportunities for gradients or
not. Some ideas for it come next.
This expression says that the slope of the line connecting the points f (x) and f (y)
is no greater than L.
Functions satisfying this condition are called Lipschitz functions; also, colloqui-
ally, one could say that a function f (x) is Lipschitz.
The Lipschitz condition is stronger than continuity. Indeed, Lipschitz functions
are continuous functions.
The function |x| is Lipschitz, but not differentiable. The function x is a contin-
uous function but it is not a Lipschitz function because it becomes infinitely steep as
x 0. Likewise, the function x 2 is a continuous function but is not Lipschitz, since
it becomes arbitrarily steep as x . See [101] for more examples.
In the case of differentiable functions, gradients are defined and so they can be
object of study. The gradient of a function is Lipschitz if a constant L > 0 exists
with:
f (x) f (y)2 L x y2 , x, y n (A.209)
On the basis of the Cauchy-Schwarz inequality, it can be shown (lemma) that if the
gradient is Lipschitz then:
L
| f (y) f (x) f (x)T (y x)| y x22 (A.210)
2
Put in other way:
L
f (y) f (x) + f (x)T (y x)| + y x22 (A.211)
2
A consequence of this lemma is that:
L
x22 f (x) (A.212)
2
is convex.
For twice differentiable f (x), if the gradient is Lipschitz: 2 f (x) L I
Figure A.33 depicts the upper bound in case the gradient of f (x) is Lipschitz.
Now, for lower bounds one considers some more definitions related to convexity.
f ( x + (1 ) y) f (x) + (1 ) f (y) ,
(A.213)
x, y ; 0 1
The function f (x) is strictly convex if the equality holds only when x = y:
f ( x + (1 ) y) f (x) + (1 ) f (y) m
2
( (1 )) x y22 ,
x, y, 0 1
(A.215)
Again, if f (x) is differentiable it is possible to consider gradients, obtaining some
more inequalities corresponding to convexity.
The function f (x) is strictly convex if the equality holds only when x = y:
m
f (y) f (x) + f (x)T (y x) + y x22 , x, y (A.218)
2
For twice differentiable f (x), the function is convex iff 2 f (x) 0, and strongly
convex iff 2 f (x) m I . The condition 2 f (x) > 0 is sufficient for f (x) to be
strictly convex, but it is not a necessary condition (for example, f (x) = x 4 ).
Figure A.34 depicts the lower bound in case of f (x) being differentiable and
strongly convex.
Notice that in the case of a differentiable and strongly convex f (x) with the
gradient being Lipschitz, one has the following sandwich:
m L
C(x, y) + y x22 f (y) C(x, y) + y x22 (A.219)
2 2
where:
C(x, y) = f (x) + f (x)T (y x) (A.220)
1
( f (y) f (x))T (y x) f (y) f (x)22 x, y (A.225)
L
and so, the gradient f (x) is co-coercive with modulus 1/L.
Strong convexity and Lipschitz gradient are related by Fenchel duality. According
with the Lemma 5.10 in [114]:
1. If f (.) is strongly convex with modulus m, the conjugate f (.) has Lipschitz
gradient with L = 1/m.
2. If f (.) is convex and has Lipschitz gradient, the conjugate f (.) is strongly convex
with modulus m = 1/L.
326 Appendix A: Selected Topics of Mathematical Optimization
There are several ways of choosing the step size. For instance, you could try to deter-
mine the size which obtains the maximum descent; this is a line search optimization
problem that must be solved for each step of the iteration. However, it has been
recognized that this method is not the most convenient [121].
One could sometimes opt for a fixed step size. In this case, if f (x) is differentiable
and convex, and with Lipschitz gradient, then the gradient descent algorithm with
< (2/L) will converge to a stationary point. Let us use the lemma (A.211) to
analyze this convergence:
x0 x 22
f (xk ) f (x ) (A.227)
2k
A practical way for using the Armijos rule is backtracking line search. It is a
simple method that adapts the size of each step as follows:
start with t = 1
while: f (xk t f (xk )) > . f (xk ) t
2
f (xk )22 , update t = t
where 0 < < 1 is a parameter you fix beforehand.
Like in the case of fixed step size, if f (x) is differentiable and convex, and with
Lipschitz gradient, then the convergence would be:
x0 x 22
f (xk ) f (x ) (A.230)
2 tmin k
then the step is too small (you have to specify a parameter < < 1).
A more detailed account of line searching methods can be found in [50].
In order to gain insight about gradient descent issues it is really opportune to study
the quadratic function case, for several reasons.
328 Appendix A: Selected Topics of Mathematical Optimization
One important reason is that, according with the matrix form of Taylors theorem,
one could devise the following approximation:
1
f (xk+1 ) f (xk ) + f (xk )T (xk+1 xk ) + (xk+1 xk )T H (xk+1 xk )
2
(A.233)
(H , the Hessian, can also be written as 2 f (x)).
This expression shows the interest of quadratic approximations for the study of
gradient descent behaviour.
It is also opportune to mention that for a twice differentiable and strongly convex
f (x) with Lipschitz gradient, there is another sandwich:
m I 2 f (x) L I (A.234)
the ratio = L/m would be an upper bound on the condition number of the matrix
2 f (x) (the condition number is the ratio of the largest eigenvalue to the smallest
eigenvalue).
For the purposes of gradient descent convergence, it is good to have close to 1
(the best case is = 1).
By the way, it is said that an algorithm has linear convergence if its iterations
satisfy:
f (xk+1 ) f (x )
(A.235)
f (xk ) f (x )
where the constant < 1 is the convergence constant (it is good to have a small )
Now, here is the quadratic function to be studied:
1 T
f (x) = x Q x + cT x (A.236)
2
where Q is a positive definite symmetric matrix, so all its eigenvalues are >0. There-
fore, f (x) is strongly convex.
The minimum of f (x) is attained at:
x = Q 1 c (A.237)
d = f (xk ) = Q xk c (A.239)
Appendix A: Selected Topics of Mathematical Optimization 329
On the basis of this equation, it is possible to determine the optimal step size, which
is:
dT d
= T (A.241)
d Qd
f (xk+1 ) f (x ) 1
= 1 (A.242)
f (xk ) f (x )
with:
(dT Q d)(dT Q 1 d)
= (A.243)
(dT d)2
and study the evolution of the gradient descent for different eigenvalue choices.
Figure A.36 depicts a typical path of the gradient descent. Notice that each step is
orthogonal to each other. Depending on the initial state, the directions of steps may
point more or less precisely to the origin (which is the optimum). When the condition
number increases the contours of the quadratic function become more elongated, and
the zig-zags become more pronounced.
330 Appendix A: Selected Topics of Mathematical Optimization
A.7.1.4 Acceleration
According with the heavy-ball metaphor, the state xk is like a point mass with
momentum, so it has a tendency to continue moving in the direction xk xk1 . The
constants are chosen as follows:
2
4 1 2
= ; = 1 (A.248)
L (1 + 1/ )2 1+
Compared with steepest descent, it needs q = times fewer steps (for instance,
if = 144, then q = 12 times less steps).
A popular method is the conjugate gradient descent. It can be represented as
follows:
xk+1 = xk + k pk (A.249)
The parameter k can be chosen in several ways. For example as the minimizer
of f (x) along pk . Another alternative is:
f (xk )22
k = (A.251)
f (xk1 )22
with:
tk1 1+ 1 + 4 tk2 1
k = ; tk+1 = ; = (A.254)
tk+1 2 L
With this method, for differentiable convex f (x) having Lipschitz gradient, Nes-
terov achieves:
4 L x0 x 22
f (xk ) f (x ) (A.255)
(k + 2)2
where Q = L/m.
In 1988 Barzilai and Borwein [9] introduced the following method:
with:
sk = xk xk1 ; zk = f (xk ) f (xk1 ) (A.259)
Frequently this method is cited as the BB method. According with [122] the BB
method is in practice much better than the standard steepest descent. A more recent
paper [92] contributes with an improved version of BB.
See [87] for more details on the accelerated gradient methods.
As seen in the previous sub-section, gradient-based iterative methods for finding the
maximum of a function proceed along a series of points step by step. Typically these
methods first choose a step direction and then a step size. On the contrary, trust region
(TR) methods first choose a step size, estimating a certain TR, and then choose a
step direction. According with the review [120] of TR algorithms, they constitute a
class of relatively new algorithms.
Before coming to details of TR algorithms, it seems opportune to include a sum-
mary of some traditional methods.
332 Appendix A: Selected Topics of Mathematical Optimization
The basic idea of line search methods is, for a given point xk and a direction pk to
minimize:
f (xk + pk ) (A.260)
As it was already introduced, the Newtons method is based on a second order Taylor
expansion:
1 T
f (xk + pk ) f (xk ) + f (xk ) pk + p H f (xk ) pk (A.262)
2 k
(where H f ( ) is the Hessian)
To minimize it, the differentiation with respect to pk is made equal to zero:
f (xk )
xk+1 = xk (A.264)
H f (xk )
Essentially, the idea of Newton is to use curvature information for a more direct
path.
In practical terms, the use of the Hessian could be complicated. Moreover, matrix
inversions may cause numerical difficulties. In consequence, some alternatives have
been proposed to approximate the Hessian by a matrix, or even to directly approxi-
mate its inverse. This approximation is updated in each step, in order to improve it.
The methods that put into practice these ideas are called Quasi-Newton methods.
A typical approximation would be:
1 T
f (xk + pk ) f (xk ) + f (xk ) pk + p Bk pk (A.265)
2 k
where Bk is a positive definite symmetrical matrix.
Appendix A: Selected Topics of Mathematical Optimization 333
Then:
f (xk )
xk+1 = xk (A.266)
Bk
The matrix Bk is chosen so that the direction of the step tends to approximate the
Newtons step. In order to capture the curvature information, this is done according
with the following equation:
which is called the secant equation. Notice that it is a simple Taylor series approxi-
mation of the gradient.
The secant equation can also be expressed in the following way:
Except for the scalar case, the secant equation has many solutions. Taking advan-
tage of the degrees of freedom, and assuming that the Hessian would not be wildly
varying, the matrix Bk is chosen to be close to an initial matrix B0 that is normally
specified as a diagonal of suitable constants.
Suppose that the matrix Bk+1 is obtained by adding a correction term to Bk ; then
the secant equation would be:
Hence:
Ck pk+1 = qk+1 Bk pk+1 (A.270)
q qT B p pT B
C() = + r v vT (A.271)
qT p pT B p
with:
q Bp
v = ; r = qT B q (A.272)
qT p r
The famous Broyden, Fletcher, Goldfarb and Shanno (BFGS) method is obtained
when taking = 0.
If, instead of approximating the Hessian, one chooses to approximate its inverse,
so the secant equation is written as:
D k q k = pk (A.273)
p pT B q qT B
G() = + r w wT (A.274)
pT q qT B q
with:
p Bq
w = ; r = qT B q (A.275)
pq r
By setting = 0 one obtains the Davidson, Fletcher and Powell (DFP) method. And,
setting = 1 one obtains the BFGS method (which is superior to DFP). The general
family of updates you can obtain with different values 0 1 is the Broyden
family.
When applied to quadratic functions, quasi-Newton methods result in conjugate
direction methods.
Some of the MATLAB Optimization Toolbox functions use BFGS, or a low-
memory variant called L-BFGS.
Reference [3] provides an overview of quasi-Newton methods with extensive
bibliography. Likewise, it is easy to find on Internet several interesting dissertations
on this topic. An alternative Newton-like step size selection has been proposed in
[116].
Suppose you are at xk , the idea is to use an approximate model that tells you f (x) for
x near xk . This model is trusted in a certain region. If the model fits well, this region
can be enlarged. Otherwise, if the model works not good enough, the TR should be
reduced. The step direction is chosen inside the TR and it minimizes f (x).
Frequently, a quadratic model (similar to the Taylor expansion) is used:
1 T
m(xk + p) = f (xk ) + f (xk ) p + p Bk p (A.276)
2
When the matrix Bk is chosen to be the Hessian, so Bk = H f (xk ), the algorithm
is called the trust-region Newton method.
In general, all that was assumed is that Bk is symmetric and uniformly bounded.
The locally constrained TR problem is:
1
min m(xk + p) = min f (xk ) + f (xk ) pk + pkT Bk pk (A.277)
pT pT 2
were T is the trust region. Usually this region is a ball with radius r :
T (r ) = {x | x xk r } (A.278)
Appendix A: Selected Topics of Mathematical Optimization 335
The result of the minimization would be a certain new point x+ . Now, some
conditions are stated to accept a step or not. Define the actual reduction as:
Now, one takes into account the ratio = Ra /R p and three parameters: 0 L <
H . Based on these parameters:
1. If < 0 then the step is rejected
2. If < L then the radius of T should be decreased
3. If L H then accept the step
4. If > H then the radius of T should be increased
Typically, the radius is divided by 2 when it should be decreased, or multiplied
by 2 when it should be increased.
There are several alternatives for solving the minimization (A.277) [120]. One
of them is the trust-region-reflective method, which uses as trust region a two-
dimensional subspace V. A common choice of V is the linear space spanned by v1 ,
the vector in the direction of the gradient gk , and v2 , which is either the solution of:
Hk v2 = gk (A.281)
or a direction with:
v2T Hk v2 < 0 (A.282)
Cutting planes can be used in other contexts, not only in integer programming prob-
lems. Actually they have earned an important role in continuous variable optimization
methods, including non-smooth scenarios.
This part of the Appendix is based in [19], which contains the mathematical details
omitted in this succinct summary.
Given a feasible set F our aim would be to determine a (small) subset X such
that the optimal solution belongs to X (called the target set). The idea is to apply
a sequence of cuts to F, obtaining smaller and smaller regions that contain the
optimum.
The cutting plane method uses an oracle. When you query the oracle about a
certain point x n , the oracle tells you that x X , so a successful result has been
reached, or it gives you the parameters a, b corresponding to a separating hyperplane
between x and X :
336 Appendix A: Selected Topics of Mathematical Optimization
X .x
a T z b f or z X and a T x b
f o (x)T (z x) 0 (A.283)
Given a not feasible point x, this means that at least one of the constraints is
violated; for instance: f j (x) > 0. Then, any feasible z satisfies:
One could start the process choosing a ball large enough to contain X . Then, the
oracle is queried at a series of points x1 , x2 , . . . , xk . The sequence stops if a point
belonging to X is found. Otherwise, what is most probable, one has obtained a series
of cutting planes that can be written as:
aiT z bi , i = 1, 2, . . . , k (A.285)
. .
X
. .
338 Appendix A: Selected Topics of Mathematical Optimization
As said before, in many cases the application of sparse representation leads to opti-
mization issues. Typically optimization involves objective functions, to be maximised
or minimised, and a set of constraints (see the appendix on optimization topics for
fundamental concepts and methods).
It has been found that it is not always possible to have differentiable objective
functions. The article [46] includes a long section describing several sources of
non-differentiability in optimization problems; likewise, the book [103] includes, in
Chap. 5, an extensive reference of this kind of problems. Standard analytical methods
for optimization, which are based on derivatives, are difficult to apply in such cases.
This section is devoted to the treatment of non-differentiable optimization prob-
lems, [70]. Important concepts to be introduced are those of subgradient and subd-
ifferential, which prolong to a certain extent the classical concepts of gradient and
differential. The section focuses on minimization problems.
Some initial examples would be helpful to illustrate why we are concerned with
non-differentiability.
A typical case, which is found in image processing, involves an objective function
like z = |x|, which is differentiable for all x
= 0, but it is non-differentiable at
x = 0. Figure A.39 shows a plot of this function. It happens that the minimum of
the function takes place at x = 0.
A similar situation is depicted in Fig. A.40, where two curved components define
a convex function that is non-differentiable at x0 . It also happens that this is the point
where the function reaches its minimum.
Something has to be devised in order to keep, as much as possible, the traditional
methodology, which determines the minimum by equalling derivatives to zero. With
this in mind, a repertory of new concepts has been proposed. Let us now introduce
some of these concepts.
x
Appendix A: Selected Topics of Mathematical Optimization 339
L2
L1
x0 x
Considering in particular the Fig. A.40, there would be infinite lines that go
through the point (x0 , f (x0 )) and which are either touching or below the graph
of the function. The slope of any of these lines is a subderivative. The two lines,
L 1 and L 2 , on Fig. A.40 are two examples of these lines.
In more formal terms, a subderivative at x0 is a real number c such that:
The set [a, b] of all subderivatives is called the subdifferential of the function f ()
at x0 . The subdifferential is denoted as f (x0 ).
These concepts can be generalized to functions of several variables. A vector g is
called a subgradient of the function f () at x0 if:
-1
The book of Rockafellar [95], in 1970 with the title Convex Analysis, provided
momentum for new theory developments. It was realized that many convex opti-
mization problems do not have a derivative at the optimum, and this motivated the
introduction of subdifferentials. The term nonsmooth analysis was introduced by
Clarke [30], 1983, who was the second student of Rockafellar and who extended the
study to locally Lipschitz functions. New important contributions in order to further
extensions have been done by Mordukhovich [79, 80] in 2006.
xk+1 = xk k gk (A.289)
lead to a descent of the function value; for this reason one usually retains a memory
of the best point found along the iterations. The number of iterations could be high
(perhaps millions).
The convergence of the method has been subject of much study. Some relevant
results are summarized in [16], starting from the assumption that there is a bound G
such that gk 2 G, which is for instance the case when f () satisfies the Lipschitz
condition:
| f (u) f (v)| G u v2 (A.290)
Square summable:
k2 < (A.291)
k
lim k = 0 (A.292)
k
(typical example: k = a/ k with positive a).
And also, diminishing step length:
k = k / gk 2 ; lim k = 0; k = (A.293)
k
k
In these three alternatives the procedure converges to the optimum, that is: D 0
as k .
When the optimal value f is known, it is convenient to use the Polyaks step
size, which is given by:
f (xk ) f
k = k (A.294)
gk 22
Indeed, there are some optimization problems for which f is known, like for
example the feasibility problem of finding a point xc satisfying a particular system
of inequalities.
Some modifications of the Polyaks step size have been proposed to circumvent
the use of f .
The projected subgradient method is a simple extension of the basic method that
is suitable for the following convex constrained problem:
Minimize f (x)
Subject to x C
(C is a convex set)
The projected subgradient method is given by:
where P is a projection on C.
The step sizes can be chosen as in the basic method, with similar convergence
properties.
For non-negative orthant, the projection would be: P(xi ) = max(xi , 0)
For a linear system such that: C = {x : Ax = b}, the projection would be:
xk+1 = xk k (I A T (A A T )1 A) gk (A.297)
(if the current point is feasible one uses f 0 (.), if not one uses a subgradient of any
violated constraint).
As said before, subgradients are not guaranteed to be directions of descent. Some
modifications have been proposed to detect good directions. One is the conjugated
subgradient method, which includes a minimization task (to find the best direction)
in each iteration. Another technique is called -subgradient method, which uses
approximated subgradients. See [100] for details.
There are reasons, like for instance the vastness of the optimization problem
at hand or its involved randomness, which can recommend the use of stochastic
subgradient methods [17, 117]. This method is almost the same as the subgradient
method, but using noisy subgradients and a reduced set of step size alternatives.
It is said that g is a noisy subgradient of f (.) at x (which is random) if E(g | x)
f (x) (where E(.) is the expected value). One could consider the noisy subgradient
as the sum of g and a zero-mean random variable.
The stochastic subgradient iteration is:
xk+1 = xk k gk (A.299)
(x j )k+1 = (x j )k k (g j )k (A.300)
Bn = {(xk , f (xk ), gk ), k = 1, 2, . . . , n}
Let us illustrate how the piecewise model is built. Figure A.42 shows a convex
objective function f (.) 0. The problem is to find its minimum. A first point x1 has
been chosen, and a first model of the objective function has been found: the line L 1 .
The next point, x2 , to be added to the exploration is simply found at the intersection
of L 1 with the horizontal axis.
x1 x2 x
Appendix A: Selected Topics of Mathematical Optimization 345
L2
x1 x2 x3 x
x1 x2 x4 x3 x
Taking f (x2 )as new point for a linear approximation, which is plotted as the line
L 2 , it is possible to determine the next point to be added, x3 . Now the model consists
of two linear pieces (a V), taken from the two lines. Figure A.43 depicts this step.
Figure A.44 shows the result of a third step. Notice that now the next point to
be added, x4 , is found at the intersection of lines L 2 and L 3 . The piecewise model,
made of segments of L 1 , L 2 , and L 3 is getting better.
Along the modelling iterations, the cutting plane method monitors the following
quantity, which monotonically decreases:
Once this quantity becomes smaller than a certain specified threshold, the iterative
process terminates.
A series of observed drawbacks of the cutting plane basic scheme has been
reported by the scientific literature. Therefore, several improvements have been
proposed, [69, 103]. The article [68] offers a modern perspective, and opportune
references. A fairly recent survey can be found in [75].
Suppose that the function to be minimized has several points where it is not dif-
ferentiable, and so the plot of the function has kinks (informally speaking) where
346 Appendix A: Selected Topics of Mathematical Optimization
envelope
(Moreau employed = 1)
Directly related to this envelope, the proximal mapping is defined by:
1
Pr ox f (x) = arg min f (u) + u x22 (A.307)
u 2
1
M f (x) = (I Pr ox f (x) ) (A.309)
Also, if f (x)is convex and plsc, the function inside the parenthesis in (A.307) is
strongly convex because of the added Euclidean norm term.
Note the connection of (A.306) and (A.307) with aspects related to gradient
descent in the appendix on optimization.
Appendix A: Selected Topics of Mathematical Optimization 347
Moreau introduced the envelope and the proximal mapping in 1962 [81] as a way
of regularizing and approximating a convex function. Actually it was called prox-
regularization, and it can be regarded as analogue to the Tichonov regularization
[54]. The book [98] places the envelope and the proximal mapping in a rigorous
mathematical context.
The Moreaus envelope is an example of infimal convolution. The infimal convo-
lution of closed proper convex functions f (x) and g(x) is defined as:
to get a minimizer xk .
The functions pk (x) are chosen so that the infinite sequence {xk } converges to a
solution of the f (x) minimization problem.
In the context of constrained minimization, where a feasible set exists, one could
distinguish two approaches. Barrier-function methods (for instance, interior-point
methods) require the minimizers {xk } to belong to the feasible set. This is not the
case with penalty-function methods, where constraint violations are discouraged but
not prohibited (hence, sometimes they are called exterior-point methods).
Some examples of penalty functions, like the absolute-value, the Courant-
Bertrami, the quadratic-loss, etc. can be found in [24]. Also, in [24], connections
of penalty-function methods with the minimization of cross-entropy, the regularized
least-squares, and the Lagrangian approach in optimization, are described.
As highlighted in [24], the proximal approach can be used for a sequential mini-
mization algorithm (this will be introduced later on).
Some examples of envelopes and proximal maps
A first and important example is f (x) = |x|. Its Moreaus envelope is:
1
x 2 x [, ]
M f (x) = 2 (A.312)
|x| 2 other wise
(a) (b)
x - x
0 x [, ]
Pr ox f (x) = x x > (A.313)
x + x <
then: Pr ox f (x) = arg min u x22 = PC (x), which is the projection over set C.
u
And, finally, the case of the Euclidean norm f (x) = x2 . Its proximal mapping
is
1
x x2
Pr ox f (x) = x2 . (A.318)
0 other wise
x = Pr ox f (x ) (A.321)
Thus, the proximal mapping of f (x, y) is done with the proximal mapping of each
of the separable parts. In the general case of a fully separable f (x), the proximal
mapping would consist of proximal mappings of scalar functions.
The proximal point algorithm
As announced before, when dealing with penalty-functions, there is a sequential
minimization algorithm, based on the proximal mapping. The algorithm is simply
expressed a follows:
350 Appendix A: Selected Topics of Mathematical Optimization
The function g is called the proximal average of the set of functions. Not to be
confused with the concept of averaged operator.
Let us enter into another aspect. The already cited article [97] connects monotone
operators and the proximal mapping. In the case of functions, the concept of
monotonicity is clear: a monotonically increasing function f (x) would never
decrease, or vice-versa. In the case of operators, it is said that a set-valued oper-
ator M (mapping x to a set M(x)) is monotone if:
R = (I + M)1 (A.328)
with A and B monotone, A(x) single valued, and B with easy to compute resolvent.
An important algorithm is the following:
where g(x) and h(x) are closed, proper and convex, and g(x) is differentiable. The
function h(x) can be extended-valued, so h(x) : n {+}.
The proximal gradient method for the minimization problem is:
When g(x) is Lipschitz with constant L, the method converges with rate O(1/k)
if k (0, 1/L). If L is not known, the step size k can be found by line search [88].
The iterative method can be interpreted as a fixed point iteration. A point x is a
solution of the minimization problem iff:
x = (I + h)1 (I g) (x ) (A.337)
or equivalently:
x = Pr ox h (x g(x ) (A.338)
where g(x) is a convex upper bound to g(x) satisfying that g(x, xk ) g(x) and
g(x, x) = g(x).
This generic algorithm iteratively exerts a majorization, with the upper bound,
and then minimizes the majorization.
An example of upper bound could be the following:
1
g(x, y) = g(y) + g(y)T (x y) + x y22 (A.340)
2
Consider now the function:
This function is a surrogate for g(x) + h(x), with fixed y. It can be shown that the
majorization-minimization iteration:
Appendix A: Selected Topics of Mathematical Optimization 353
1
x (xk g(xk )) 2 + h(x)
2
(A.343)
2
Therefore, the solution tries to balance the gradient step with the weighted contribu-
tion of h(x).
Nesterov [82] introduced and accelerated proximal gradient method, with the
following scheme:
yk+1 = xk + wk (xk xk1 ) (A.344)
When g(x) is Lipschitz with constant L, the accelerated method converges with
rate O(1/k 2 ) if k (0, 1/L). Notice the difference with the standard method,
which converges with rate O(1/k) . Also like before, if L is not known, the step size
k can be found by line search
An extensive treatise on monotone operators and proximal methods can be found
in [10].
For more material on non-smooth optimization, it would be recommended to visit
the web page of N. Karmitsa (see the Resources section).
The MATLAB Optimization Toolbox (OpT) provides functions for most of the opti-
mization aspects described in this Appendix.
By using the optimtool( ) function, MATLAB offers access via a GUI to four
general categories of optimization solvers:
Minimizers, including:
Unconstrained optimization
Linear programming
Quadratic programming
Nonlinear programming
354 Appendix A: Selected Topics of Mathematical Optimization
In accord with the topics considered in this Appendix, our first comments are
devoted to linear and quadratic programming, and then to other optimization prob-
lems. See [41] for a MATLAB OpT tutorial.
The function for linear programming is linprog( ), which implements three types of
algorithms:
A simplex algorithm
An active-set algorithm
A primal-dual interior point method
Quadratic programming problems can be tackled with the function quadprog( ). The
function includes two types of algorithms:
Large scale algorithms use sparse representations, so they do not need large com-
putational resources. On the contrary, medium scale algorithms use full matrices
and may require large computational efforts, taking perhaps a long time. You can
use large scale algorithms on a small problem. However, the medium scale algo-
rithm offer extra functionality, such as more types of constraints, and maybe better
performance.
MATLAB offers two alternatives for solving this kind of problem: to use
fminsearch( ) or to use fminunc( ).
The function fminsearch( ) is based on the Nelder-Mead simplex algorithm. Here
the word simplex denotes a polytope with n + 1 vertices (so it has nothing to do
with the simplex method employed for linear programming). The algorithm starts
with an initial simplex, and then modifies the simplex repeatedly, moving it and
getting it smaller and smaller until convergence to the solution.
The function fminunc( ) has two algorithm options. If you have information on the
gradient, use the trust-region algorithm; otherwise, use the quasi-newton algo-
rithm.
Before using any of these functions for unconstrained optimization, you can use
the function optimset( ) to specify options, like for instance to specify the chosen
algorithm when using fminunc( ).
If one opens the solver menu, a series of functions can be accessed, as shown
in Fig. A.48. Some of them, those beginning with lsq, are devoted to least squares.
Most of the others have been already described.
A.11 Resources
A.11.1 MATLAB
A.11.1.1 Toolboxes
CVX:
http://www.stanford.edu/~boyd/software.html
YALMIP:
http://control.ee.ethz.ch/index.cgi?action=details\&id=2088\&
MOSEK:
http://docs.mosek.com/7.0/toolbox/
UNLocBoX:
https://lts2.epfl.ch/unlocbox/
OPTI Toolbox:
http://www.i2c2.aut.ac.nz/Wiki/OPTI/index.php/Main/WhatIsOPTI?
TOMLAB:
http://tomopt.com/tomlab/about/
Stephen P. Boyd:
http://www.stanford.edu/~boyd/software.html
LIPSOL:
www.caam.rice.edu/~zhang/lipsol/
Guanghui Lan:
http://www.ise.ufl.edu/glan/computer-codes/
Stanford.edu:
http://web.stanford.edu/~yyye/matlab.html
NCSU.edu:
http://www4.ncsu.edu/~ctk/matlab_darts.html
SDPT3:
http://www.math.nus.edu.sg/~mattohkc/sdpt3.html
Mark Schmidt:
https://www.cs.ubc.ca/~schmidtm/Software/code.html
Optimization in Practice with MATLAB (book):
http://www.cambridge.org/us/academic/subjects/engineering/control-systems
-and-optimization/
358 Appendix A: Selected Topics of Mathematical Optimization
A.11.2 Internet
References
1. I. Adler, The equivalence of linear programs and zero-sum games. Int. J. Game
Theory 42(1), 165177 (2013)
2. A.A. Ahmadi, A. Olshevsky, P.A. Parrillo, J.N. Tsitsiklis, NP-hardness of
deciding convexity of quartic polynomials and related problems. Math. Pro-
gramm. 137(12), 453476 (2013) series A
3. M. Al-Baali, H. Khalfan, An overview of some practical quasi-Newton methods
for unconstrained optimization. SQU J. Sci. 12(2), 199209 (2007)
4. F. Alizadeh, Optimization over the positive-definite cone: interior point meth-
ods and combinatorial applications, eds. by P.M. Pardalos. Advances in Opti-
mization and Parallel Computing (North-Holland, 1992)
5. F. Alizadeh, D. Goldfarb, Second-order cone programming. Math. Programm.
95(1), 351 (2003)
Appendix A: Selected Topics of Mathematical Optimization 359
6. E.D. Andersen, How to use Farkas lemma to say something important about
linear infeasible problems. Technical report, MOSEK Technical Report, 2011.
TR-20111.
7. Y. Bai, M.E. Ghami, C. Roos, A comparative study of kernel functions for
primal-dual interior-point algorithms in linear optimization. SIAM J. Optim.
15(1), 101128 (2004)
8. T. Baran, D. Wei, A.V. Oppenheim, Linear programming algorithms for sparse
filter design. IEEE T. Signal Process. 58(3), 16051617 (2010)
9. J. Barzilai, J.M. Borwein, Two point step size gradient methods. IMA J. Numer.
Anal. 8, 141148 (1988)
10. H.H. Bauschke, P.L. Combettes, Convex Analysis and Monotone Operator
Theory (Springer Verlag, 2010)
11. A. Beck, M. Teboulle, Gradient-based algorithms with applications to signal
recovery, in Convex Optimization in Signal Processing and Communications,
pp. 4288 (2009)
12. R. Bellman, K. Fan, On systems of linear inequalities in hermitian matrix vari-
ables. Proc. Symp. Pure Math. VII, 111. Amer. Math. Soc., 1963. Providence
13. P. Belotti, Optimization Models and Applications Clemson University, IE 426
Lecture Notes, Lecture 20, 2009. www.myweb.clemson.edu/~pbelott/bulk/
teaching/lehigh/ie426-f09/lecture20.pdf
14. D.P. Bertsekas, Convex Optimization Theory (Athena Scientific, 2014) supple-
ment. http://www-mit.mit.edu/dimitrib/www/convexdualitychapter.pdf
15. K.C. Border, Separating Hyperplane Theorems (Caltech, 2010). http://www.
hss.caltech.edu/~kcb/Notes/SeparatingHyperplane.pdf
16. S. Boyd, A. Mutapcic, Subgradient Methods (Stanford University, Lecture
Notes for EE364b, 2007). https://web.stanford.edu/class/ee364b/lectures/
subgrad_method_notes.pdf
17. S. Boyd, A. Mutapcic, Stochastic Subgradient Methods (Stanford Univer-
sity, Lecture Notes for EE364b, 2008). https://web.stanford.edu/class/ee364b/
lectures/stoch_subgrad_notes.pdf.
18. S. Boyd, L. Vandenberghe, Convex Optimization (Cambridge University Press,
2004)
19. S. Boyd, L. Vandenberghe, Localization and Cutting Plane Methods (Stanford
University, Lecture Notes, 2007). www.stanford.edu/class/ee392o/localization
-methods.
20. S. Boyd, L. Vandenberghe, Interior-point Methods (2008). https://web.stanford.
edu/class/ee364a/lectures/barrier.pdf
21. S. Boyd, L. Vandenberghe, Subgradients (Stanford University, Lecture Notes
for EE364b, 2008). https://web.stanford.edu/class/ee364b/lectures/
subgradients_notes.pdf
22. S. Boyd, L. Vandenberghe, J. Skaf, Analytic Center Cutting-plane Method
(Stanford University, 2008). http://112.112.8.207/resource/data/20100601/U/
stanford201001010/06-accpm_notes.pdf
23. S. Burer, Copositive programming, in Handbook on Semidefinite, Conic and
Polynomial Optimization, pp. 201218 (Springer USA, 2012.)
360 Appendix A: Selected Topics of Mathematical Optimization
The example of the two-tank system was chosen. Figure B.1 depicts the outputs of
the system, which are the measurements of tank heights
Figure B.2 compares the evolution of the 2-variable system, in continuous curves,
and the state estimation yield by the Kalman filter, depicted by xmarks.
Figure B.3 shows the evolution of the error.
Figure B.4 shows the evolution of the Kalman filter gains.
Figure B.5 shows the evolution of the a priori state covariance.
Figure B.6 shows the evolution of the estimated state covariance.
0.8 y1
0.6
0.4
y2
0.2
-0.2
0 5 10 15 20 25 30 35 40
sampling periods
0.6
xe1
0.4
xe2
0.2
x2
0
-0.2
0 5 10 15 20 25 30 35 40
sampling periods
0.3
0.2
0.1
-0.1
er2
-0.2
0 5 10 15 20 25 30 35 40
sampling periods
0.2 0.2
0 0
0 10 20 30 40 0 10 20 30 40
0.8 0.8
0.6 0.6
K21 K22
0.4 0.4
0.2 0.2
0 0
0 10 20 30 40 0 10 20 30 40
sampling periods
Appendix B: Long Programs 369
x 10 -3 x 10 -3
1.5 1.5
1 1
M11 M12
0.5 0.5
0 0
0 10 20 30 40 0 10 20 30 40
-3 -3
x 10 x 10
1.5 1.5
M22
1 1
M21
0.5 0.5
0 0
0 10 20 30 40 0 10 20 30 40
sampling periods
x 10 -4 x 10 -4
6 6
4 4
P11 P12
2 2
0 0
0 10 20 30 40 0 10 20 30 40
x 10 -4 x 10 -4
6 6
4 4
P21 P22
2 2
0 0
0 10 20 30 40 0 10 20 30 40
sampling periods
aa=[0 40 0 0.0018];
subplot(2,2,1); plot(shiftdim(M(1,1,:)),'k');
axis(aa);title('M11');%plots M11
subplot(2,2,2); plot(shiftdim(M(1,2,:)),'k');
axis(aa);title('M12');%plots M12
subplot(2,2,3); plot(shiftdim(M(2,1,:)),'k');
axis(aa);title('M21');%plots M21
subplot(2,2,4); plot(shiftdim(M(2,2,:)),'k');
axis(aa);title('M22'); %plots M22
xlabel('sampling periods');
% display of P(n) evolution
figure(6)
aa=[0 40 0 0.0006];
subplot(2,2,1); plot(shiftdim(P(1,1,:)),'k');
axis(aa);title('P11');%plots P11
subplot(2,2,2); plot(shiftdim(P(1,2,:)),'k');
axis(aa);title('P12');%plots P12
subplot(2,2,3); plot(shiftdim(P(2,1,:)),'k');
axis(aa);title('P21');%plots P21
subplot(2,2,4); plot(shiftdim(P(2,2,:)),'k');
axis(aa);title('P22'); %plots P22
xlabel('sampling periods');
0.9 0.9
0.6 0.6
0.5 0.5
x1
x1
0.4 0.4
0.3 0.3
0.2 0.2
Sw
0.1 0.1
0 Process noise 0
-0.1 -0.1
-0.1 0 0.1 0.2 0.3 -0.1 0 0.1 0.2 0.3
x2 x2
Fig. B.7 The prediction step, from left to right (Fig. 1.11)
0.9 0.9
0.6 0.6
Estimated y
0.5 0.5
x1
x1
0.4 0.4
0.3 0.3
0.2 0.2
0.1 Sv 0.1
Measurement
0 noise 0
-0.1 -0.1
-0.1 0 0.1 0.2 0.3 0.4 -0.1 0 0.1 0.2 0.3 0.4
x2 x2
x1
x1
0.4 0.4 0.4
0 0 0
+(((px2(nj)-mu2).^2)/C(2,2));
Swpdf(ni,nj)= K*exp(-Q*aux1);
end;
end;
%the aP PDF (P(n))
mu1=xe1(ns1); mu2=xe2(ns1);
C=P(:,:,ns1); D=det(C);
K=1/(2*pi*sqrt(D)); Q=(C(1,1)*C(2,2))/(2*D);
aPpdf=zeros(pN1,pN2); %space for the PDF
for ni=1:pN1,
for nj=1:pN2,
aux1=(((px1(ni)-mu1)^2)/C(1,1))...
+(((px2(nj)-mu2).^2)/C(2,2))...
-(((px1(ni)-mu1).*(px2(nj)-mu2)/C(1,2)*C(2,1)));
aPpdf(ni,nj)= K*exp(-Q*aux1);
end;
end;
%the M PDF
mu1=rxa(1,ns2); mu2=rxa(2,ns2);
C=M(:,:,ns2); D=det(C);
K=1/(2*pi*sqrt(D)); Q=(C(1,1)*C(2,2))/(2*D);
Mpdf=zeros(pN1,pN2); %space for the PDF
for ni=1:pN1,
for nj=1:pN2,
aux1=(((px1(ni)-mu1)^2)/C(1,1))...
+(((px2(nj)-mu2).^2)/C(2,2))...
-(((px1(ni)-mu1).*(px2(nj)-mu2)/C(1,2)*C(2,1)));
Mpdf(ni,nj)= K*exp(-Q*aux1);
end;
end;
%the Sv PDF
mu1=0; mu2=0;
C=Sv; D=det(C);
K=1/(2*pi*sqrt(D)); Q=(C(1,1)*C(2,2))/(2*D);
Svpdf=zeros(pN1,pN2); %space for the PDF
for ni=1:pN1,
for nj=1:pN2,
aux1=(((px1(ni)-mu1)^2)/C(1,1))...
+(((px2(nj)-mu2).^2)/C(2,2));
Svpdf(ni,nj)= K*exp(-Q*aux1);
end;
end;
%the eY PDF (estimated y)
ya=rya(:,ns2); %estimated y
mu1=ya(1); mu2=ya(2);
C=cY(:,:,ns2); D=det(C);
K=1/(2*pi*sqrt(D)); Q=(C(1,1)*C(2,2))/(2*D);
eYpdf=zeros(pN1,pN2); %space for the PDF
for ni=1:pN1,
Appendix B: Long Programs 377
for nj=1:pN2,
aux1=(((px1(ni)-mu1)^2)/C(1,1))+(((px2(nj)-mu2).^2)/C(2,2));
eYpdf(ni,nj)= K*exp(-Q*aux1);
end;
end;
%the P PDF (P(n+1))
mu1=xe1(ns2); mu2=xe2(ns2);
C=P(:,:,ns2); D=det(C);
K=1/(2*pi*sqrt(D)); Q=(C(1,1)*C(2,2))/(2*D);
Ppdf=zeros(pN1,pN2); %space for the PDF
for ni=1:pN1,
for nj=1:pN2,
aux1=(((px1(ni)-mu1)^2)/C(1,1))...
+(((px2(nj)-mu2).^2)/C(2,2))...
-(((px1(ni)-mu1).*(px2(nj)-mu2)/C(1,2)*C(2,1)));
Ppdf(ni,nj)= K*exp(-Q*aux1);
end;
end;
%---------------------------------------------------
%display
figure(1)
subplot(1,2,1)
contour(px2,px1,Swpdf); hold on;
contour(px2,px1,aPpdf);
xlabel('x2'); ylabel('x1');
title('<process noise>, <xe(n)>');
subplot(1,2,2)
contour(px2,px1,Mpdf); hold on;
xlabel('x2'); ylabel('x1');
title('<xa(n+1)>');
figure(2)
subplot(1,2,1)
contour(px2,px1,Svpdf); hold on;
contour(px2,px1,Mpdf);
xlabel('x2'); ylabel('x1');
title('<measurement noise>, xa(n+1)>');
subplot(1,2,2)
contour(px2,px1,eYpdf); hold on;
plot(rym(2,ns1),rym(1,ns1),'r*','MarkerSize',12);
xlabel('x2'); ylabel('x1');
title('<estimated y>');
figure(3)
subplot(1,3,1)
contour(px2,px1,Mpdf);
xlabel('x2'); ylabel('x1');
title('<xa(n+1)>');
subplot(1,3,2)
contour(px2,px1,eYpdf);
xlabel('x2'); ylabel('x1');
378 Appendix B: Long Programs
title('<estimated y>');
subplot(1,3,3)
contour(px2,px1,Ppdf);
xlabel('x2'); ylabel('x1');
title('<xe(n+1)>');
Figure B.10 shows a possible situation,when the original PDF, with = 2, is shifted
0.8 to the right.
1.5 1.5
nonlinear
1 1 measurement
0.5 0.5
after
0 0
-0.5 -0.5
-1 -1
-1.5 -1.5
0 2000 -10 -5 0 5 10
before
2000
0
-10 -5 0 5 10
Figures B.11 and B.12 show the evolution of state variables, measurements, and air
drag during the fall under noisy conditions.
380 Appendix B: Long Programs
x 10 4
12 1000
altitude velocity
0
10
-1000
8
-2000
6
-3000
4
-4000
2
-5000
0 -6000
0 10 20 30 40 0 10 20 30 40
seconds seconds
Fig. B.11 System states (cross marks) under noisy conditions (Fig. 1.27)
x 104
12 1000
distance measurement drag
900
10
800
700
8
600
6 500
400
4
300
200
2
100
0 0
0 10 20 30 40 0 10 20 30 40
seconds seconds
title('altitude'); xlabel('seconds')
axis([0 Nf*T 0 12*10^4]);
subplot(1,2,2)
plot(tim,rx(2,1:Nf),'kx');
title('velocity'); xlabel('seconds');
axis([0 Nf*T -6000 1000]);
figure(2)
subplot(1,2,1)
plot(tim,ry(1:Nf),'k');
title('distance measurement');
xlabel('seconds');
axis([0 Nf*T 0 12*10^4]);
subplot(1,2,2)
plot(tim,rd(1:Nf),'k');
title('drag');
xlabel('seconds');
axis([0 Nf*T 0 1000]);
Next figures show the EKF results for the falling body in noisy conditions (Figs. B.13,
B.14, B.15, B.16, B.17 and B.18).
x 10 4
12 1000
altitude velocity
0
10
-1000
8
-2000
6
-3000
4
-4000
2
-5000
0 -6000
0 10 20 30 40 0 10 20 30 40
seconds seconds
Fig. B.13 System states (cross marks), and states estimated by the EKF (continuous) (Fig. 1.29)
Appendix B: Long Programs 383
3000 500
altitude estimation error velocity estimation error
400
2000
300
200
1000
100
0 0
-100
-1000
-200
-300
-2000
-400
-3000 -500
0 10 20 30 40 0 10 20 30 40
seconds seconds
6 500
400
4
300
200
2
100
0 0
0 10 20 30 40 0 10 20 30 40
seconds seconds
-0.02 -0.1
-0.5
-0.04 -0.2
-1
-0.06 -0.3
-0.08 -0.4
-1.5
-0.1 -0.5
-2
-0.12 -0.6
x 10 4
10 0
8
-500
P11 P12
6
-1000
4
-1500
2
0 -2000
0 10 20 30 40 0 10 20 30 40
0 4000
-500 3000
P21 P22
-1000 2000
-1500 1000
-2000 0
0 10 20 30 40 0 10 20 30 40
sampling periods
x 10-3
0.12 1
Kalman gain, altitude Kalman gain: velocity
0.5
0.1
0
0.08
-0.5
0.06 -1
-1.5
0.04
-2
0.02
-2.5
0 -3
0 10 20 30 40 0 10 20 30 40
seconds seconds
Fig. B.18 Evolution of the altitude and velocity Kalman gains (Fig. 1.32)
subplot(1,3,3)
plot(tim,rf2(3,1:Nf),'k');
title('jacobian term f23');
xlabel('seconds');
figure(5)
subplot(1,2,1)
plot(tim,K(1,1:Nf),'k');
title('Kalman gain, altitude');
xlabel('seconds');
axis([0 Nf*T 0 0.12]);
subplot(1,2,2)
plot(tim,K(2,1:Nf),'k');
title('Kalman gain: velocity');
xlabel('seconds');
axis([0 Nf*T -0.003 0.001]);
figure(6)
% display of P(n) evolution
subplot(2,2,1); plot(tim,shiftdim(P(1,1,1:Nf)),'k');
title('P11');%plots P11
subplot(2,2,2); plot(tim,shiftdim(P(1,2,1:Nf)),'k');
title('P12');%plots P12
subplot(2,2,3); plot(tim,shiftdim(P(2,1,1:Nf)),'k');
title('P21');%plots P21
subplot(2,2,4); plot(tim,shiftdim(P(2,2,1:Nf)),'k');
title('P22'); %plots P22
xlabel('sampling periods');
With the propagated sigma points it is possible to compute the mean and variance of
the propagated data (Fig. B.19).
Appendix B: Long Programs 389
1.5 1.5
nonlinear
after
1 1 measurement
0.5 0.5
0 0
-0.5 -0.5
-1 -1
-1.5 -1.5
0 2000 -10 -5 0 5 10
before
2000
0
-10 -5 0 5 10
640
620
m
600
580
560
540
520
500
0.5 1 1.5 2 2.5
rad
Fig. B.21 The uncertainty 650
on the Cartesian plane, and
sigma points (Fig. 1.40) 600
550
m
500
450
400
350
-500 -400 -300 -200 -100 0 100 200 300 400 500
m
Fig. B.22 The UT 800
approximation and
propagated data points 700
(Fig. 1.41)
600
500
m
400
300
200
100
0
-600 -400 -200 0 200 400 600
m
392 Appendix B: Long Programs
aux=zeros(2,2); aux1=zeros(2,2);
for nn=1:4,
aux=aux+(ys(:,nn)-ya(:))*(ys(:,nn)-ya(:))';
end;
aux=aux/(2*lN);
aux1=(ys0-ya)*(ys0-ya)';
Py=aux+(aaN*aux1);
%the Gaussian PDF approximation
C=Py;
mu1=ya(1); mu2=ya(2);
D=det(C);
K=1/(2*pi*sqrt(D)); Q=(C(1,1)*C(2,2))/(2*D);
x1=-700:10:700;
x2=0:10:900;
pN1=length(x1); pN2=length(x2);
yp=zeros(pN1,pN2); %space for the PDF
for ni=1:pN1,
for nj=1:pN2,
aux1=(((x1(ni)-mu1)^2)/C(1,1))+(((x2(nj)-mu2).^2)/C(2,2));
yp(ni,nj)= K*exp(-Q*aux1);
end;
end;
%Set of random measurements
Np=1000; %number of random points
px=zeros(1,Np);
py=zeros(1,Np);
nr=sigr*randn(1,Np);
na=siga*randn(1,Np);
for nn=1:Np,
r=r0+nr(nn);
a=alpha0+na(nn);
px(nn)=r*cos(a);
py(nn)=r*sin(a);
end;
xmean=sum(px/Np);
ymean=sum(py/Np);
figure(3)
plot(px,py,'g.'); hold on; %the points
contour(x1,x2,yp'); %the UT PDF approximation
plot(ya(1),ya(2),'b+', 'MarkerSize',12); %the PDF center
plot(xmean,ymean,'kx', 'MarkerSize',12); %the data mean
title('Some propagated data points, and the PDF approximation by
UT');
xlabel('m'); ylabel('m'); axis([-700 700 0 800]);
Appendix B: Long Programs 395
The example of the falling body is used to illustrate the UKF algorithm (Figs. B.23,
B.24, B.25, B.26 and B.27).
x 10 4
12 1000
altitude velocity
0
10
-1000
8
-2000
6
-3000
4
-4000
2
-5000
0 -6000
0 10 20 30 40 0 10 20 30 40
seconds seconds
Fig. B.23 System states (cross marks), and states estimated by the UKF (continuous) (Fig. 1.42)
3000 500
altitude estimation error velocity estimation error
400
2000
300
200
1000
100
0 0
-100
-1000
-200
-300
-2000
-400
-3000 -500
0 10 20 30 40 0 10 20 30 40
seconds seconds
x 104
12 1000
distance measurement drag
900
10
800
700
8
600
6 500
400
4
300
200
2
100
0 0
0 10 20 30 40 0 10 20 30 40
seconds seconds
x 10 5 x 10 4
4 2
3 1.5
P11 P12
2 1
1 0.5
0 0
0 10 20 30 40 0 10 20 30 40
x 10 4 x 10 4
2 5
4
1.5
P21 3 P22
1
2
0.5
1
0 0
0 10 20 30 40 0 10 20 30 40
sampling periods
x 10-3
0.35 20
Kalman gain, altitude Kalman gain: velocity
0.3
15
0.25
10
0.2
0.15 5
0.1
0
0.05
-5
0
0 10 20 30 40 0 10 20 30 40
seconds seconds
Fig. B.27 Evolution of the altitude and velocity Kalman gains (Fig. 1.45)
d=(rho*(xs(2,m)^2))/(2*xs(3,m)); %drag
xas(1,m)=xs(1,m)+(xs(2,m)*T);
xas(2,m)=xs(2,m)+((g+d)*T);
xas(3,m)=xs(3,m);
end;
%a priori state mean (a weighted sum)
xa=0;
for m=1:6,
xa=xa+(xas(:,m));
end;
xa=xa/(2*lN);
xa=xa+(LaN*xas(:,7));
%a priori cov.
aux=zeros(3,3); aux1=zeros(3,3);
for m=1:6,
aux=aux+((xas(:,m)-xa(:))*(xas(:,m)-xa(:))');
end;
aux=aux/(2*lN);
aux1=((xas(:,7)-xa(:))*(xas(:,7)-xa(:))');
aux=aux+(aaN*aux1);
M(:,:,nn+1)=aux+Sw;
%Update
%propagation of sigma points (measurement)
for m=1:7,
yas(m)=sqrt(L2+(xas(1,m)^2));
end;
%measurement mean
ya=0;
for m=1:6,
ya=ya+yas(m);
end;
ya=ya/(2*lN);
ya=ya+(LaN*yas(7));
%measurement cov.
aux2=0;
for m=1:6,
aux2=aux2+((yas(m)-ya)^2);
end;
aux2=aux2/(2*lN);
aux2=aux2+(aaN*((yas(7)-ya)^2));
Syy=aux2+Sv;
%cross cov
aux2=0;
for m=1:6,
aux2=aux2+((xas(:,m)-xa(:))*(yas(m)-ya));
end;
aux2=aux2/(2*lN);
aux2=aux2+(aaN*((xas(:,7)-xa(:))*(yas(7)-ya)));
Sxy=aux2;
400 Appendix B: Long Programs
Using a particle filter for the falling body example (Figs. B.28, B.29 and B.30).
x 10 4
12 1000
altitude velocity
0
10
-1000
8
-2000
6
-3000
4
-4000
2
-5000
0 -6000
0 10 20 30 40 0 10 20 30 40
seconds seconds
Fig. B.28 System states (cross marks), and states estimated by the particle filter (continuous)
(Fig. 1.46)
402 Appendix B: Long Programs
3000 500
altitude estimation error velocity estimation error
400
2000
300
200
1000
100
0 0
-100
-1000
-200
-300
-2000
-400
-3000 -500
0 10 20 30 40 0 10 20 30 40
seconds seconds
x 104
12 1000
distance measurement drag
900
10
800
700
8
600
6 500
400
4
300
200
2
100
0 0
0 10 20 30 40 0 10 20 30 40
seconds seconds
while nn<Nf+1,
%estimation recording
rxe(:,nn)=xe; %state
rer(:,nn)=x-xe; %error
%Simulation of the system
%system
rx(:,nn)=x; %state recording
rho=rho0*exp(-x(1)/k); %air density
d=(rho*(x(2)^2))/(2*x(3)); %drag
rd(nn)=d; %drag recording
%next system state
x(1)=x(1)+(x(2)*T);
x(2)=x(2)+((g+d)*T);
x(3)=x(3);
x=x+(w.*wx(:,nn)); %additive noise
%system output
y=sqrt(L2+(x(1)^2))+(v11*wy(nn)); %additive noise
ym=y; %measurement
ry(nn)=ym; %measurement recording
%Particle propagation
wp=randn(3,Np); %noise (process)
vm=randn(1,Np); %noise (measurement)
for ip=1:Np,
rho=rho0*exp(-px(1,ip)/k); %air density
d=(rho*(px(2,ip)^2))/(2*px(3,ip)); %drag
%next state
apx(1,ip)=px(1,ip)+(px(2,ip)*T);
apx(2,ip)=px(2,ip)+((g+d)*T);
apx(3,ip)=px(3,ip);
apx(:,ip)=apx(:,ip)+(w.*wp(:,ip)); %additive noise
%measurement (for next state)
ny(ip)=sqrt(L2+(apx(1,ip)^2))+(v11*vm(ip)); %additive noise
vy(ip)=ym-ny(ip);
end;
%Likelihood
%(vectorized part)
%scaling
vs=max(abs(vy))/4;
ip=1:Np;
pq(ip)=exp(-((vy(ip)/vs).^2));
spq=sum(pq);
%normalization
pq(ip)=pq(ip)/spq;
%Prepare for roughening
A=(max(apx')-min(apx'))';
sig=0.2*A*Np^(-1/3);
rn=randn(3,Np); %random numbers
%===========================================================
%Resampling (systematic)
Appendix B: Long Programs 405
acq=cumsum(pq);
cmb=linspace(0,1-(1/Np),Np)+(rand(1)/Np); %the "comb"
cmb(Np+1)=1;
ip=1; mm=1;
while(ip<=Np),
if (cmb(ip)<acq(mm)),
aux=apx(:,mm);
px(:,ip)=aux+(sig.*rn(:,ip)); %roughening
ip=ip+1;
else
mm=mm+1;
end;
end;
%===========================================================
%Results
%estimated state (the particle mean)
xe=sum(px,2)/Np;
nn=nn+1;
end;
%------------------------------------------------------
%display
figure(1)
subplot(1,2,1)
plot(tim,rx(1,1:Nf),'kx'); hold on;
plot(tim,rxe(1,1:Nf),'r');
title('altitude'); xlabel('seconds')
axis([0 Nf*T 0 12*10^4]);
subplot(1,2,2)
plot(tim,rx(2,1:Nf),'kx'); hold on;
plot(tim,rxe(2,1:Nf),'r');
title('velocity'); xlabel('seconds');
axis([0 Nf*T -6000 1000]);
figure(2)
subplot(1,2,1)
plot(tim,rer(1,1:Nf),'k');
title('altitude estimation error');
xlabel('seconds');
axis([0 Nf*T -3000 3000]);
subplot(1,2,2)
plot(tim,rer(2,1:Nf),'k');
title('velocity estimation error');
xlabel('seconds');
axis([0 Nf*T -500 500]);
figure(3)
subplot(1,2,1)
plot(tim,ry(1:Nf),'k');
title('distance measurement');
xlabel('seconds');
axis([0 Nf*T 0 12*10^4]);
406 Appendix B: Long Programs
subplot(1,2,2)
plot(tim,rd(1:Nf),'k');
title('drag');
xlabel('seconds');
axis([0 Nf*T 0 1000]);
200
150
100
50
0
0 0.5 1 1.5 2 2.5 3 3.5
x 10-3
0.8
0.6
0.4
0.2
0
0 100 200 300 400 500 600 700 800 900 1000
Appendix B: Long Programs 407
150
histogram of prior particles
100
50
0
9.35 9.4 9.45 9.5 9.55 9.6 9.65 9.7 9.75 9.8
x 10 4
150
histogram of (multinomial) resampled particles
100
50
0
9.35 9.4 9.45 9.5 9.55 9.6 9.65 9.7 9.75 9.8
x 10 4
120 150
Systematic Stratified
100 resampling resampling
80 100
60
40 50
20
0 0
9.4 9.5 9.6 9.7 9.8 9.4 9.5 9.6 9.7 9.8
x 10 4 x 10 4
120 250
Multinomial Residual
100 resampling 200 resampling
80
150
60
100
40
20 50
0 0
9.4 9.5 9.6 9.7 9.8 9.4 9.5 9.6 9.7 9.8
x 10 4 x 10 4
nn=1;
while nn<Nf+1,
%Simulation of the system
%system
rho=rho0*exp(-x(1)/k); %air density
d=(rho*(x(2)^2))/(2*x(3)); %drag
%next system state
x(1)=x(1)+(x(2)*T);
x(2)=x(2)+((g+d)*T);
x(3)=x(3);
x=x+(w.*wx(:,nn)); %additive noise
%system output
y=sqrt(L2+(x(1)^2))+(v11*wy(nn)); %additive noise
ym=y; %measurement
%Particle propagation
wp=randn(3,Np); %noise (process)
vm=randn(1,Np); %noise (measurement)
for ip=1:Np,
rho=rho0*exp(-px(1,ip)/k); %air density
d=(rho*(px(2,ip)^2))/(2*px(3,ip)); %drag
%next state
apx(1,ip)=px(1,ip)+(px(2,ip)*T);
apx(2,ip)=px(2,ip)+((g+d)*T);
apx(3,ip)=px(3,ip);
apx(:,ip)=apx(:,ip)+(w.*wp(:,ip)); %additive noise
%measurement (for next state)
ny(ip)=sqrt(L2+(apx(1,ip)^2))+(v11*vm(ip)); %additive noise
vy(ip)=ym-ny(ip);
end;
%Likelihood
%(vectorized part)
%scaling
vs=max(abs(vy))/4;
ip=1:Np;
pq(ip)=exp(-((vy(ip)/vs).^2));
spq=sum(pq);
%normalization
pq(ip)=pq(ip)/spq;
%Prepare for roughening
A=(max(apx')-min(apx'))';
sig=0.2*A*Np^(-1/3);
rn=randn(3,Np); %random numbers
%===========================================================
%Resampling (systematic)
acq=cumsum(pq);
cmb=linspace(0,1-(1/Np),Np)+(rand(1)/Np); %the "comb"
cmb(Np+1)=1;
ip=1; mm=1;
while(ip<=Np),
410 Appendix B: Long Programs
if (cmb(ip)<acq(mm)),
aux=apx(:,mm);
Spx(:,ip)=aux+(sig.*rn(:,ip)); %roughening
ip=ip+1;
else
mm=mm+1;
end;
end;
%===========================================================
%Resampling (multinomial)
acq=cumsum(pq);
mm=1;
nr=sort(rand(1,Np)); %ordered random numbers (0, 1]
for ip=1:Np,
while(acq(mm)<nr(ip)),
mm=mm+1;
end;
aux=apx(:,mm);
Mpx(:,ip)=aux+(sig.*rn(:,ip)); %roughening
end;
%===========================================================
%Resampling (residual)
acq=cumsum(pq);
mm=1;
%preparation
na=floor(Np*pq); %repetition counts
NR=sum(na); %total count
Npr=Np-NR; %number of non-repeated particles
rpq=((Np*pq)-na)/Npr; %modified weights
acq=cumsum(rpq); %for the monomial part
%deterministic part
mm=1;
for j=1:Np,
for nn=1:na(j),
Rpx(:,mm)=apx(:,j);
mm=mm+1;
end;
end;
%multinomial part:
nr=sort(rand(1,Npr)); %ordered random numbers (0, 1]
for j=1:Npr,
while(acq(mm)<nr(j)),
mm=mm+1;
end;
aux=apx(:,mm);
Rpx(:,NR+j)=aux+(sig.*rn(:,j)); %roughening
end;
%===========================================================
%Resampling (stratified)
Appendix B: Long Programs 411
acq=cumsum(pq);
stf=zeros(1,Np);
nr=rand(1,Np)/Np;
j=1:Np;
stf(j)=nr(j)+((j-1)/Np); %(vectorized code)
stf(Np+1)=1;
ip=1; mm=1;
while(ip<=Np),
if (stf(ip)<acq(mm)),
aux=apx(:,mm);
Fpx(:,ip)=aux+(sig.*rn(:,ip)); %roughening
ip=ip+1;
else
mm=mm+1;
end;
end;
%===========================================================
px=Spx; %posterior (edit to select a resampling method)
%Results
%estimated state (the particle mean)
xe=sum(px,2)/Np;
nn=nn+1;
end;
%------------------------------------------------------
%display
figure(1)
hist(pq,20);
title('histogram of weights');
figure(2)
plot(acq);
title('cumsum() of weights');
figure(3)
subplot(2,1,1)
bx=9.4e4:1.5e2:9.8e4;
hist(apx(1,:),bx);
title('histogram of prior particles')
subplot(2,1,2)
hist(Mpx(1,:),bx);
title('histogram of (multinomial) resampled particles');
figure(4)
Spt=hist(Spx(1,:),bx);
Mpt=hist(Mpx(1,:),bx);
Rpt=hist(Rpx(1,:),bx);
Fpt=hist(Fpx(1,:),bx);
subplot(2,2,1)
stem(bx,Spt,'k');
title('Systematic resampling')
subplot(2,2,2)
stem(bx,Fpt,'k');
412 Appendix B: Long Programs
title('Stratified resampling')
subplot(2,2,3)
stem(bx,Mpt,'k');
title('Multinomial resampling');
subplot(2,2,4)
stem(bx,Rpt,'k');
title('Residual resampling')
Next figure compares the results of the Laplaces method and the variational method
for the approximation of the Students T PDF (Fig. B.35).
0.5
Sudents PDF (black), Laplace approximation (green)
0.4
0.3
0.2
0.1
0
-5 -4 -3 -2 -1 0 1 2 3 4 5
values
0.5
Sudents PDF (black), KLD approximation (green)
0.4
0.3
0.2
0.1
0
-5 -4 -3 -2 -1 0 1 2 3 4 5
values
Fig. B.35 Approximations of Students T PDF: (top) Laplaces method, (bottom) KLD minimiza-
tion (Fig. 1.56)
Appendix B: Long Programs 413
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 5 10 15 20 25 30 35 40
sampling periods
sn(1,:)=sqrt(Sw(1,1))*randn(1,Nf);
sn(2,:)=sqrt(Sw(2,2))*randn(1,Nf);
%observation noise
Sv=[6e-4 0; 0 15e-4]; %cov.
on=zeros(2,Nf);
on(1,:)=sqrt(Sv(1,1))*randn(1,Nf);
on(2,:)=sqrt(Sv(2,2))*randn(1,Nf);
% system simulation preparation
x=[1;0]; % state vector with initial tank levels
u=0.4; %constant input
% Kalman filter simulation preparation
%space for matrices
K=zeros(2,2); M=zeros(2,2); P=zeros(2,2);
xe=[0.5; 0.2]; % filter state vector with initial values
%space for recording xe(n),ym(n)
rxe=zeros(2,Nf-1);
rym=zeros(2,Nf);
rym(:,1)=C*x; %initial value
%behaviour of the system and the Kalman
% filter after initial state
% with constant input u
for nn=1:Nf-1,
%system simulation
xn=(A*x)+(B*u)+sn(nn); %next system state
x=xn; %system state actualization
ym=(C*x)+on(:,nn); %output measurement
%Prediction
xa=(A*xe)+(B*u); %a priori state
M=(A*P*A')+ Sw;
%Update
K=(M*C')*inv((C*M*C')+Sv);
P=M-(K*C*M);
xe=xa+(K*(ym-(C*xa))); %estimated (a posteriori) state
%recording xe(n),ym(n)
rxe(:,nn)=xe;
rym(:,nn+1)=ym;
end;
%Smoothing-----------------------------
% Smoothing preparation
N=zeros(2,2); P=zeros(2,2);
% augmented state vectors
axa=zeros(2*(L+1),1);
axp=zeros(2*(L+1),1);
% augmented input
bu=zeros(2*(L+1),1); bu(1:2,1)=B*u;
% augmented A matrix
aA=diag(ones(2*L,1),-2); aA(1:2,1:2)=A;
% augmented K
aK=zeros(2*(L+1),2);
416 Appendix B: Long Programs
% set of covariances
Pj=zeros(2,2,L);
%space for recording xs(n)
rxs=zeros(2,Nf-1);
jed=(2*L)+1; %pointer for last entries
%jed=1;
%action:
axa(1:2,1)=rxe(:,1); %initial values
for nn=1:Nf,
M=(A*P*A')+Sw;
N=(C*P*C')+Sv;
ivN=inv(N);
K=(A*P*C')*ivN;
aK(1:2,:)=K;
aK(3:4,:)=(P*C')*ivN;
for jj=1:L,
bg=1+(jj*2); ed=bg+1;
aK(bg:ed,:)=(Pj(:,:,jj)*C')*ivN;
end;
aux=[A-K*C]';
Pj(:,:,1)=P*aux;
for jj=1:L-1,
Pj(:,:,jj+1)=Pj(:,:,jj)*aux;
end;
axp=(aA*axa)+bu+aK*(rym(:,nn)-C*axa(1:2,1));
P=M-(K*N*K');
rxs(:,nn)=axp(jed:jed+1);
axa=axp; %actualization (implies shifting)
end;
%-----------------------------------------------------
% display of state evolution
figure(3)
plot(rxs(1,L:end),'r'); %plots xs1
hold on;
plot(rxs(2,L:end),'b'); %plots xs2
plot(rxe(1,:),'mx'); %plots xe1
plot(rxe(2,:),'kx'); %plots xe2
axis([0 Nf 0 1]);
xlabel('sampling periods');
title('Kalman filter states(x) and Smoothed states(-)');
Example of smoothing of state estimate at a fixed point (Figs. B.37 and B.38).
Appendix B: Long Programs 417
0.6
0.5
0.4
0.3
0.2
0.1
0
0 5 10 15 20 25 30 35 40
sampling periods
-4 -5
x 10 x 10
5 6
4
4
3
2
2
1
0 0
0 10 20 30 40 0 10 20 30 40
-5 -4
x 10 x 10
8 6
6
4
4
2
2
0 0
0 10 20 30 40 0 10 20 30 40
Fig. B.38 Evolution of covariances in the fixed-point smoothing example (Fig. 1.61)
end;
%Smoothing-----------------------------
% Smoothing preparation
Nfix=10; %the fixed point
%space for matrices
N=zeros(2,2); P11=zeros(2,2); P21=zeros(2,2);
% augmented state vectors
axa=zeros(4,1);
axp=zeros(4,1);
% augmented input
bu=zeros(4,1); bu(1:2,1)=B*u;
% augmented A matrix
aA=diag(ones(4,1)); aA(1:2,1:2)=A;
% augmented K
aK=zeros(4,2);
%space for recording xs(Nfix), P11(n)
rxs=zeros(2,Nf);
rP11=zeros(2,2,Nf);
%action:
P11=rP(:,:,Nfix); P21=P11; %initial values
axa(1:2,1)=rxe(:,Nfix); %initial values
axa(3:4,1)=rxe(:,Nfix); %initial values
for nn=Nfix:Nf,
M=(A*P11*A')+Sw;
N=(C*P11*C')+Sv;
ivN=inv(N);
K=(A*P11*C')*ivN;
Ka=(P21*C')*ivN;
aK(1:2,:)=K; aK(3:4,:)=Ka;
axp=(aA*axa)+bu+aK*(rym(:,nn)-C*axa(1:2,1));
axa=axp; %actualization
rP21(:,:,nn)=P21; %recording
rxs(:,nn)=axp(3:4,1);
P11=M-(K*N*K');
P21=P21*(A-(K*C))';
end;
%----------------------------------------------------
% display of smoothed state at Nfix
figure(3)
plot(rxs(1,Nfix:end),'r'); %plots xs1
hold on;
plot(rxs(2,Nfix:end),'b'); %plots xs2
axis([0 Nf 0 0.6]);
xlabel('sampling periods');
title('State smoothing at Nfix');
% display of Covariance evolution
figure(4)
subplot(2,2,1)
plot(squeeze(rP21(1,1,Nfix:end)),'k');
420 Appendix B: Long Programs
title('Evolution of covariance');
subplot(2,2,2)
plot(squeeze(rP21(1,2,Nfix:end)),'k');
subplot(2,2,3)
plot(squeeze(rP21(2,1,Nfix:end)),'k');
subplot(2,2,4)
plot(squeeze(rP21(2,2,Nfix:end)),'k');
%Apply KSVD-------------------------------------------------
disp('apply K-SVD...please wait')
for kk=1:4, %4 KSVD iterations (you can add more)
CFM=spc(Di,PAM,etg); %sparse coding
%search for better dictionary elements (bDE)
pr=randperm(K);
for in=pr,
%data indices that use jth dictionary elements
rDI=find(CFM(in,:)); %relevant data indices
if(length(rDI)<1),
%when there are no such data indices
aux=PAM-Di*CFM; aux2=sum(aux.^2);
[d,i]=max(aux2);
bDE=PAM(:,i);
bDE=bDE./(sqrt(bDE'*bDE));
bDE=bDE.*sign(bDE(1));
CFM(in,:)=0;
else
%when there are such data indices
Maux=CFM(:,rDI);
Maux(in,:)=0; %elements to be improved
ers=PAM(:,rDI)-Di*Maux; %vector of errors to minimize
[bDE,SV,bV]=svds(ers,1); %the SVD
CFM(in,rDI)=SV*bV'; %use sign of first element
end;
Di(:,in)=bDE; %insert better dictionary element;
end;
disp(['iteration: ',num2str(kk)]);
end;
%clean dictionary
B1=3; B2=0.99;
%removing of identical atoms
er=sum((PAM-Di*CFM).^2,1);
Maux=Di'*Di; Maux=Maux-diag(diag(Maux));
for i=1:K,
aux=length(find(abs(CFM(i,:))>1e-7));
if (max(Maux(i,:))>B2) || (aux<=B1),
[v,ps]=max(er); er(ps(1))=0;
Di(:,i)=PAM(:,ps(1))/norm(PAM(:,ps(1)));
Maux=Di'*Di;Maux=Maux-diag(diag(Maux));
end;
end;
disp('dictionary ready')
%Image denoising---------------------------------------------
Fout=zeros(fc,fr); %prepare space for denoised image
wgt=zeros(fc,fr); %weights
bks=im2col(Fn,[8,8],'sliding');
nub=size(bks,2); %number of blocks
Appendix B: Long Programs 423
ix=[1:nub];
disp('compute coefficients for denoising')
%Proceed with sets of 25000 coefficients
for nj=1:25000:nub,
jsz=min(nj+25000-1,nub); %jump size
cf= spc(Di,bks(:,nj:jsz),etg); %coefficients (sparse coding)
bks(:,nj:jsz)=Di*cf;
disp(['subset: ',num2str(nj),'-',num2str(jsz)]);
end;
disp('start denoising');
nn=1;
[r,c]=ind2sub([fc,fr]-7,ix);
for j=1:length(c),
ic=c(j); ir=r(j);
bk=reshape(bks(:,nn),[8,8]); %a block
Fout(ir:ir+7,ic:ic+7)=Fout(ir:ir+7,ic:ic+7)+bk;
wgt(ir:ir+7,ic:ic+7)=wgt(ir:ir+7,ic:ic+7)+ones(8);
nn=nn+1;
end;
%combine with noisy image
Fd = (Fn+0.034*sig*Fout)./(1+0.034*sig*wgt);
%Result-------------------------------------------------------
disp('result display');
figure(1)
subplot(1,2,1)
imshow(F0,[]);
title('original picture')
subplot(1,2,2)
imshow(Fn,[]);
title('noisy image')
figure(2)
imshow(Fd,[])
title('denoised image')
figure(3)
cD=zeros(9,9,1,K); %collection of Dictionary patches
for np=1:K,
ni=1;
for j=1:8,
for i=1:8,
cD(i,j,1,np)=Di(ni,np);
ni=ni+1;
end;
cD(9,j,1,np)=1; %white line
end;
%patch contrast augmentation
xx=1:8;
cD(xx,xx,1,np)=cD(xx,xx,1,np)-min(min(cD(xx,xx,1,np)));
aux=max(max(cD(xx,xx,1,np)));
if aux>0,
424 Appendix B: Long Programs
cD(xx,xx,1,np)=cD(xx,xx,1,np)./aux;
end;
end;
montage(cD);
title('patch dictionary');
original denoised
j=1; s(i,j)=sqrt(abs((u(i+1,j)-u(i,j))+bx(i,j))^2...
+abs((u(i,j+1)-u(i,j))+by(i,j))^2);
end;
for j=1:lx1,
i=1; s(i,j)=sqrt(abs((u(i+1,j)-u(i,j))+bx(i,j))^2...
+abs((u(i,j+1)-u(i,j))+by(i,j))^2);
end;
for i=2:ly,
j=lx; s(i,j)=sqrt(abs((u(i,j)-u(i-1,j))+bx(i,j))^2...
+abs((u(i,j)-u(i,j-1))+by(i,j))^2);
end;
for j=2:lx,
i=ly; s(i,j)=sqrt(abs((u(i,j)-u(i-1,j))+bx(i,j))^2...
+abs((u(i,j)-u(i,j-1))+by(i,j))^2);
end;
%obtain the dx, dy -------------------
ls=lambda*s; as=ls+1;
for i=2:ly1,
for j=1:lx,
dx(i,j)=(ls(i,j).*((u(i+1,j)-u(i-1,j))/2+bx(i,j)))...
/as(i,j);
end;
end;
for j=1:lx,
i=1; dx(i,j)=(ls(i,j)*((u(i+1,j)-u(i,j))+bx(i,j)))/as(i,j);
end;
for j=1:lx,
i=ly; dx(i,j)=(ls(i,j)*((u(i,j)-u(i-1,j))+bx(i,j)))...
/as(i,j);
end;
for i=1:ly,
for j=2:lx1,
dy(i,j)=(ls(i,j).*((u(i,j+1)-u(i,j-1))/2+by(i,j)))...
/as(i,j);
end;
end;
for i=1:ly,
j=1; dy(i,j)=(ls(i,j)*((u(i,j+1)-u(i,j))+by(i,j)))/as(i,j);
end;
for i=1:ly,
j=lx; dy(i,j)=(ls(i,j)*((u(i,j)-u(i,j-1))+by(i,j)))...
/as(i,j);
end;
%obtain the bx, by -------------------
for i=2:ly1,
j=1:lx;
bx(i,j)=bx(i,j)+((u(i+1,j)-u(i-1,j))/2-dx(i,j));
end;
for j=1:lx,
Appendix B: Long Programs 427
i=1; bx(i,j)=bx(i,j)+((u(i+1,j)-u(i,j))-dx(i,j));
end;
for j=1:lx,
i=ly; bx(i,j)=bx(i,j)+((u(i,j)-u(i-1,j))-dx(i,j));
end;
for i=1:ly,
j=2:lx1;
by(i,j)=by(i,j)+((u(i,j+1)-u(i,j-1))/2-dy(i,j));
end;
for i=1:ly,
j=1; by(i,j)=by(i,j)+((u(i,j+1)-u(i,j))-dy(i,j));
end;
for i=1:ly,
j=lx; by(i,j)=by(i,j)+((u(i,j)-u(i,j-1))-dy(i,j));
end;
end;
%display
figure(1)
subplot(1,2,1)
imshow(A);
title('ROF-TV denoising using Split Bregman');
xlabel('original');
subplot(1,2,2)
imshow(uint8(un));
xlabel('denoised')
Index