181 - Vggray R. - Probability, Random Processes, and Ergodic Properties
181 - Vggray R. - Probability, Random Processes, and Ergodic Properties
Chapter 8
Introduction
Given two probability measures, say p and m, on a common probability space, how different or
distant from each other are they? Similarly, given two random processes with distributions p and m,
how distant are the processes from each other and what impact does such a distance have on their
respective ergodic properties? The goal of this final chapter is to develop two quite distinct notions
of the distance d(p, m) between measures or processes and to use these ideas to delve further into
the ergodic properties of processes and the ergodic decomposition. One metric, the distributional
distance, measures how well the probabilities of certain important events match up for the two
probability measures, and hence this metric need not have any relation to any underlying metric on
the original sample space. In other words, the metric makes sense even when we are not putting
probability measures on metric spaces. The second metric, the -distance (rho-bar distance) depends
very strongly on a metric on the output space of the process and measures distance not by how
different probabilities are, but by how well one process can be made to approximate another. The
second metric is primarily useful in applications in information theory and statistics. Although
these applications are beyond the scope of this book, the metric is presented both for comparison
and because of the additional insight into ergodic properties the metric provides.
Such process metrics are of preliminary interest because they permit the quantitative assessment
of how different two random processes are and how their ergodic properties differ. Perhaps more
importantly, however, putting a metric on a space of random processes provides a topology for
the space and hence a collection of Borel sets and a measurable space to which we can assign
probability measures. This assignment of a probability measure to a space of processes provides a
general definition of a mixture process and provides a means of delving more deeply into the ergodic
decomposition of stationary measures developed in the previous chapter.
We have seen from the ergodic decomposition that all AMS measures have a stationary mean
that can be considered as a mixture of stationary and ergodic components. Thus, for example, any
stationary measure m can be considered as a mixture of the form
Z
m=
px dm(x),
173
(8.1)
174
Z
m(F ) =
px (F ) dm(x), all F B.
(8.2)
Thus stationary measures can be considered to be mixtures of stationary and ergodic measures; that
is, a stationary and ergodic measure is randomly selected from some collection and then used. In
other words, we effectively have a probability measure on the space of all stationary and ergodic
probability measures and the first measure is used to select one of the measures of the given class.
Alternatively, we have a probability measure on the space of all probability measures on the given
measurable space, but this super probability measure assigns probability one to the collection of
ergodic and stationary measures. The preceding relations show that both probabilities and expectations of measurements over the resulting mixture can be computed as integrals of the probabilities
and expectations over the component measures.
In applications of the theory of random processes such as information and communication theory,
one is often concerned not only with ordinary functions or measurements, but also with functionals
of probability measures, that is, mappings of the space of probability measures into the real line.
Examples include the entropy rate of a dynamical system, the distortion-rate function of a random
process with respect to a fidelity criterion, the information capacity of a noisy channel, the rate
required to code a given process with nearly zero reproduction error, the rate required to code a
given process for transmission with a specified fidelity, and the maximum rate of coded information
that can be transmitted over a noisy channel with nearly zero error probability. (See, for example,
the papers and commentary in Gray and Davisson [25].) All of these quantities are functionals of
measures: a given process distribution or measure yields a real number as a value of the functional.
Given such a functional of measures, say D(m), it is natural to inquire under what conditions
the analog of (8.2) for functions might hold, that is, conditions under which
Z
(8.3)
D(m) = D(px ) dm(x).
This is in general a much more difficult issue than before since unlike an ordinary measurement f ,
the functional depends on the underlying probability measure. One goal of this chapter to develop
conditions under which (8.3) holds.
8.2
We now focus on spaces of probability measures and on the structure of such spaces. We shall show
how the ergodic decomposition provides an example of a probability measure on such a space. In
Section 8.4 we shall study certain functionals defined on this space.
Let (, B) be a measurable space such that B is countably generated (e.g., it is standard). We do
not assume any metric structure for . Define P((, B)) as the class of all probability measures on
(, B). It is easy to see that P((, B)) is a convex subset of the class of all finite measures on (, B);
that is, if m1 and m2 are in P((, B)) and (0, 1), then m1 + (1 )m2 is also in P((, B)).
We can put a variety of metrics on P((, B)) and thereby make it a metric space. This will
provide a notion of closeness of probability measures on the class. It will also provide a Borel field
175
of subsets of P((, B)) on which we can put measures and thereby construct mixtures of measures
in the class.
In this section we consider a type of metric that is not very strong, but will suffice for developing
the ergodic decomposition of affine functionals. In the next section another metric will be considered,
but it is tied to the assumption of being itself a metric space and hence applies less generally.
Given any class G = {Fi ; i = 1, 2, , . . .} consisting of a countable collection of events we can define
for any measures p, m P((, B)) the function
dG (p, m) =
(8.4)
i=1
If G contains a generating field, then dG is a metric on P((, B)) since from Lemma 1.5.5 two
measures defined on a common -field generated by a field are identical if and only if they agree
on the generating field. We shall always assume that G contains such a field. We shall call such a
distance a distributional distance. Usually we will require that G be a standard field and hence that
the underlying measurable space (, B) is standard. Occasionally, however, we will wish a larger
class G but will not require that it be standard. The different requirements will be required for
different results. When the class G is understood from context, the subscript will be dropped.
Let (P(, B), dG ) denote the metric space of P((, B)) with the metric dG . A key property of
spaces of probability measures on a fixed measurable space is given in the following lemma.
Lemma 8.2.1 The metric space (P((, B)), dG ) of all probability measures on a measurable space
(, B) with a countably generated sigma-field is separable if G contains a countable generating field.
If also G is standard (as is possible if the underlying measurable space (, B) is standard), then also
(P, dG ) is complete and hence Polish.
Proof: For each n let An denote the set of nonempty intersection sets or atoms of {F1 , . . . , Fn }, the
first n sets in G. (These are sets, events in the original space.) For each set G An choose an
arbitrary point xG such that xG G. We will show that the class of all measures of the form
X
pG 1F (xG ),
r(F ) =
GAn
forms a dense set in P((, B)). Since this class is countable, P((, B)) is separable. Observe that we
are approximating all measures by finite sums of point masses. Fix a measure m (P((, B)), dG )
and an > 0. Choose n so large that 2n < /2. Thus to match up two measures in d = dG , (8.4)
implies that we must match up the probabilities of the first n sets in G since the contribution of the
remaining terms is less that 2n . Define
X
m(G)1F (xG )
rn (F ) =
GAn
X
GAn
m(G)m(F |G),
176
where m(F |G) = m(F G)/m(G) is the elementary conditional probability of F given G if m(G) > 0
and is arbitrary otherwise. For convenience we now consider the preceding sums to be confined to
those G for which m(G) > 0.
Since the G are the atoms of the first n sets {Fi }, for any of these Fi either G Fi and hence
G Fi = G or G Fi = . In the first case 1Fi (xG ) = m(Fi |G) = 1, and in the second case
1Fi (xG ) = m(Fi |G) = 0, and hence in both cases
rn (Fi ) = m(Fi ); i = 1, 2, , . . . , n.
This implies that
d(rn , m)
2i = 2n
i=n+1
.
2
Enumerate the atoms of An as {Gl ; l = 1, 2, . . . , L, where L 2n . For all l but the last (l = L)
pick a rational number pGl such that
L1
X
p Gl .
l=1
= |
L1
X
p Gl
l=1
L1
X
L1
X
m(Gl )|
l=1
|pGl m(Gl )|
l=1
L1
X l
2 .
4
4
l=1
Thus
|pG m(G)| =
GAn
L1
X
l=1
pG 1F (xG )
GAn
n
X
i=1
n
X
i=1
n
X
i=1
2i |
GAn
2i
X
GAn
|pG m(G)|
.
4
.
2