Probabilistic & Stochastic Modeling (Part I) : Probability
Probabilistic & Stochastic Modeling (Part I) : Probability
c
The purpose of this and the next lecture is to introduce probabilistic reasoning and stochastic modeling.
Historically, dynamical systems, at the end of the 19th century led to a deterministic view of the physical
world. The rst half of the twentieth century witnessed the overthrow of determinism by the essentially
probabilistic quantum theory. Of relevance to laboratory work is that variability in biological preparations
and instrument noise, common features of any experiment introduce a probabilistic element. To model
experiment, and the data that arises from experiment, it is desirable to be able to characterize both trend
and uctuations in the presence of noise. As required this should be accomplished dynamically, i.e., as
phenomena evolves in time.
Probability
Classical probability is based on frequency of occurrence. For example if in tossing a coin N times, Nh
heads occur then the probability p of tossing a head is
Nh
:
N N
p = lim
(11.1)
For generality we need to consider tossing an unfair coin. A useful substitute model is an urn lled with a large number of
identical marbles except for color, which say are black or white. One can then blindly choose a marble, with replacement, as
an equivalent to coin tossing. The odds only depend on the composition of the urn. Thus we can speak of pb & pw as the
probabilities of black & white draws and
pb + pw = 1
(11.2)
The urn model can immediately be generalized to any number of dierent colored marbles therefore many possible outcomes
with probabilities p1 , p2 , ...pk and such that
p1 + p2 + ... + pk = 1.
(11.3)
193
194
Modern probability deliberately avoids the above constructive approach in the hope of achieving clarity.
It starts with a sample space composed of events E, and a probability function dened on the events so
that:
(i) For E , P (E) 1
(ii) P () = 1
(iii) For mutually exclusive events E1 , E2 , E3 , ....
(11.4)
is purely a set of events. For example for the toss of 10 equal coins or 10 tosses of one coin there are 210
events in . If we are interested in say the probability of 5 heads in the 10 tosses knowing all the individual
cases is not handy. To deal with this and other issues a function from to the real numbers, known as a
random variable is introduced. Thus in this example if the random variable X is the number of heads we
write
pk = P (X = k)
(11.5)
pj = P (X = j); j = 0, 1, ...10
(11.6)
pj = 1.
(11.7)
Another example is throwing a pair of dice in which case the interest is not in the 36 possible throws of
but rather the sum of the pair, and therefore the random variable X in
pn = P (X = n); n = 2, 3, ...12.
Enumeration of the outcomes in the game of dice yields the following Table.
(11.8)
195
10
10
11
10
11
12
Table 11.1: Possible tosses of each die are shown in the outer column and upper row.
The inner square matrix indicates the sums.
Coin Tossing
Each outcome of a coin ip is entirely independent of the prior toss, i.e., memory does not enter. (After a
run of only heads, to say a tail is due is naive.) The particular run of k tails and n k heads is illustrated
in Table 11.2.
1
has probability
q k pnk .
(11.9)
To determine the probability of exactly k tails in n tosses, Bn (k), observe that in principal there are n! ways
of rearranging all the n cells of Table 11.2, and that there are k! ways of permuting the rst k-cells, and
(n k)! the last (n k)-cells both of which leave the outcome unchanged thus the answer to the question
What is the probability of exact k tails in n tosses? is
Bn (k) =
n!
pnk q k ,
(n k)!k!
(11.10)
which is known as the Bernoulli probability distribution. Solitary thought is usually needed to absorb this
argument. An enumerated concrete case such as n = 3 below also can be of help
196
Bn (k) = 1,
(11.11)
k=0
and to see that this is true observe that the binomial theorem, Lecture 1, states
n
(p + q)n =
k=0
n!
pnk q k
(n k)!k!
(11.12)
(p1 + p2 )n =
n1 +n2 =n
n1 ,n2 0
n!
pn1 pn2 ,
n1 !n2 ! 1 2
(11.13)
where the summation is for n1 & n2 such that n1 + n2 = n. We can generalize this to an urn with l dierent marbles having
probabilities p1 + p2 + pl = 1. The counterpart to (11.13) then is
1 = (p1 + p2 + pl )N =
n1 +n2 +nl =N
nk 0
N!
n
pn1 pn2 pl l .
n1 ! nl ! 1 2
(11.14)
The Bernoulli probability (11.10) is binomial and its multinomial generalization from (11.14) is given
P (X1 = n1 , X2 = n2 , ...Xl = nl ) =
N!
pn1 pn2 ...pnl l , n1 + ... + nl = N.
n1 !...nl ! 1 2
(11.15)
Both are two examples of common probability distributions. Both are discrete examples but the extension
to continuous distributions is straightforward.
In all cases of a probability distribution (pdf) P (x) must respect the two formal conditions (i) & (ii) which
in present notation are:
197
P (x) 0
(11.16)
and
P (x) = 1
(11.17)
where the last is deliberately ambiguous so as to cover both discrete and continuous (integration) cases.
An entire class of continuous pdfs are obtained free of eort from the gamma function (1.57). Clearly the
denition of s! the function
Gs (x) =
xs ex
s!
(11.18)
satises (11.16) & (11.17) for any s, and for obvious reasons is known as a gamma distribution. The range
of applications of (11.18) can be extended by setting x = t, but to avoid a common error we rst observe
1=
Gs (x)dx
x=t
(t)s et
dt
s!
(11.19)
Therefore
Gs (t; ) =
(t)s et
s!
(11.20)
1=
{x}
dxP (x) =
{y}
dx
P (x(y))dy
dy
(11.21)
P(y) =
dx(y)
P (x(y)).
dy
(11.22)
Observe that
x
1
n
n
(11.23)
198
ln 1
x
n
n
= n ln 1
x
n
= n
x x2
+ 2 + ...
n n
(11.24)
and therefore
1
E(x) = ex (1 + O( )).
n
(11.25)
et+n ln t dt =
ef (t,n) dt.
Clearly the integrand has a maximum in the interval of integration. To locate this observe that
(11.26)
df
dt
is at
t=n
(11.27)
f = n + n ln n
1
(t n)2 + O((t n)3 ).
2n
(11.28)
en nn+1
e 2 (s1) ds
(11.29)
Since n is large the integrand of (11.29) has a sharp peak, of width O(n1/2 ), and the lower limit can be reasonably extended
to . With this and the further variable change
x
(s 1) =
n
(11.30)
we can write
Z
n! en nn+1/2 2
ex /2
dx.
2
(11.31)
dx
= 1.
2
(11.32)
n! en nn+1/2 2,
(11.33)
ex
/2
199
2
ex /2
=
2
(11.34)
is a pdf on < x < , and is referred to as the normal distribution, or the Gaussian1 .
Exercise 11.1 (a) Go through the steps leading to (11.33) in more careful terms.
(b) Note that no where was it assumed that n is an integer. Compare Stirlings form for a continuous range
of n 1, use Matlabs gamma for this.
Another result of interest is contained in the next exercise.
Exercise 11.2 (a) Consider the Bernoulli distribution in the limit n large, p small such that np t is
to show
lim Bn =
(t)n et
= Pn (t)
n!
(11.35)
This is called the Poisson distribution. Based on our derivation of Pn (t) is the probability of n events in
a time t dened by a rate . Note the contrast with the gamma distribution (11.20) which is pdf in t.
(b) (11.35) is a pdf in n and therefore we should have
Pn (t) = 1
(11.36)
n=o
E(f ) = f
= f =
f (x)P (x)
(11.37)
and when money is involved it gives the expected gain. The mean is
E(x) =
xP (x) = x
= x = .
(11.38)
Variance
Once a mean has been determined for P (x) then the variance is dened by
2 = (x )2 = x2
2 .
(11.39)
cv =
1
An historical error since de Moivre a century before Gauss fully demonstrated its central role of (11.34) in science.
(11.40)
200
eixt P (x)
(11.41)
(0) = 1
t=o
= i x
t=o = x2
(11.42)
which is enough to calculate mean and variance, and if we continue high moments of interest can be evaluated.
For example if P = Bn (k) then since eikt = (eit )k
(t) =
n
k=o
n!
pnk q k eikt = (p + qeit )n
(n k)!k!
(11.43)
eixt (x)dx.
(11.44)
Exercise 11.3. (a) Use (11.43) to obtain the mean and variance of Bn .
(b) Show that (11.44), which is the fourier transform of the gaussian, is
2
(t) = et
/2
(11.45)
n
k)
2
(11.46)
201
G2N
= 2 = N
(11.47)
(c) Set up a Matlab program to perform 100 trials of 100 tosses for a fair coin and calculate Bn (k) and
calculate by averaging over the 100 trials. Is
1
=
N
2 N
(11.48)
x=
N
1
xj
N j=1
(11.49)
might be regarded as an approximation to . In fact on this basis the expectation of the righthand side of
(11.49) shows that
x
= .
(11.50)
Formally the expectation of (11.49) requires P(x1 , x2 , ...xN ), but in the present circumstance this is clearly
given by P(x1 )P(x2 )...P(xN ). (11.49) is said to be an unbiased estimator of , i.e., the expectation of the
quantity is the quantity being estimated. We pause to mention that at this point we have passed the blurry
boundary between Probability and Statistics2 .
A likely candidate for estimating 2 is
s
2 =
N
1
(xj x)2 ,
N j=1
(11.51)
Statistics is a funny subject. The rst time you go through it, you dont understand it at all. The second time you go
through it, you think you understand it, except for one or two small points. The third time you go through it you know you
dont understand it, but by that time you are so used to it, it doesnt bother you anymore. With a tip of the hat to Arnold
Sommerfeld who rst said this for Thermodynamics.
202
s2
=
N 1 2
.
N
(11.52)
s2 =
1
(xj x)2
N 1 j=1
(11.53)
is an unbiased estimator of 2 .
On the basis of (11.47) we can say
X1 + X2 + ... + XN
1
= O( )
N
N
(11.54)
Pr
N
(11.55)
is as small as we please. This is one of the two cornerstones of probability. The other is the remarkable:
Central Limit Theorem:
Suppose P (x) is an arbitrary pdf with mean and variance and 2 and for which nothing more is known
except that higher moments exist. Further imagine X1 , X2 , ..., XN are random variables taken from P (x).
Then
Pr
N
Xj = N
e(
N
j=1
j=1
xj N )2 /2N 2
2N 2
(11.56)
e(x) /2(/ N )
P (x) =
.
2(/ N )2
(11.57)
Proof: Consider the characteristic function of P (x1 , ..., xN ) = P (x1 )...P (xN )
Z
(t) =
eit(
n
j=1
xj )
(11.58)
Therefore
Z
P (x1 , x2 , ..., xN ) =
eit(
N
j=1
xj )
((t))N dt
(11.59)
203
(11.60)
Therefore
ln it x2
1
t2
+ 2 t2 + ...
2
2
(11.61)
x2 = 2 + 2
(11.62)
and therefore
ln it 2 t2 /2.
(11.63)
N ln iN t N t2 /2 + ...
(11.64)
For N large this shows that the area under the integral is almost entirely in the neighborhood of the origin and
Z
P (x1 , x2 , ..., xN ) =
eit(
N
j=1
xj N )N 2 t2 /2
1 e
=
2
dt
PN
2
j=1 xj N )
2N 2
N 2
(11.65)
The Central Limit Theorem states that under mild conditions, the mean of N samples taken from any
pdf is distributed as a Gaussian, as the number is increased.
Figure 11.1 below shows the result of random selection from the uniform distribution on (0, 1).
204
Fig. 11.1:
uniform pdf.
1
N
j=1
xj become Gaussian
3
1
1
P (x)dx = 4 x
+
2
2
(11.66)
which since 0 C 1 satises the uniform distribution and therefore consider rand(C) and solve for x.
The Central Limit Theorem is due to de Moivre. It has applications well beyond the statement that the
arithmetic mean is distributed normally. If you expand your view any attribute due to many (summed)
random variables is a candidate for a Gaussian description. A more concrete example appears if we return
to the Rogues Gallery Problem of Lecture 5 There we might assume that the gray levels of any pixel is a
random variable. Then since the coecients in the expansion of faces in terms of eigenfaces are each integrals
(sums) over pixels there is an expectation of the normality of their description.