Notes 05
Notes 05
Contents
1 Joint Probability Distributions                                                                                                       1
  1.1 Two Discrete Random Variables . . . . . .            .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    1
      1.1.1 Independence of Random Variables               .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    3
  1.2 Two Continuous Random Variables . . . .              .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    3
  1.3 Collection of Formulas . . . . . . . . . . .         .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    6
  1.4 Example of Double Integration . . . . . .            .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    6
                                                       1
1.1    Two Discrete Random Variables
Call the rvs X and Y . The generalization of the pmf is the joint probability mass function,
which is the probability that X takes some value x and Y takes some value y:
Since X and Y have to take on some values, all of the entries in the joint probability table
have to sum to 1:                 XX
                                         p(x, y) = 1                                   (1.2)
                                          x   y
   The events need not correspond to rectangular regions in the table. For instance, the
event X < Y corresponds to (X, Y ) combinations of (0, 1), (0, 2), and (1, 2), so
   Another event you can consider is X = x for some x, regardless of the value of Y . For
example,
                           P (X = 1) = .08 + .20 + .06 = .34                         (1.6)
But of course P (X = x) is just the pmf for X alone; when we obtain it from a joint pmf, we
call it a marginal pmf:                             X
                              pX (x) = P (X = x) =     p(x, y)                        (1.7)
                                                           y
and likewise                                             X
                                 pY (y) = P (Y = y) =           p(x, y)                        (1.8)
                                                           x
                                                  2
   For the example above, we can sum the columns to get the marginal pmf pY (y):
    y         0     1    2
    pY (y) .24 .38 .38
   or sum the rows to get the marginal pmf pX (x):
    x pX (x)
    0     .16
    1     .34
    2     .50
   They’re apparently called marginal pmfs because you can write the sums of columns and
rows in the margins:
                           y
    p(x, y)            0   1   2 pX (x)
              0       .10 .04 .02    .16
    x         1       .08 .20 .06    .34
              2       .06 .14 .30    .50
              pY (y) .24 .38 .38
We can construct from this the probability that (X, Y ) will lie within a certain region
                                           XX
                         P ((X, Y ) ∈ A) ≈        f (x, y) ∆x ∆y                        (1.10)
                                               (x,y)∈A
                                                3
What is that, exactly? First, consider special case of a rectangular region where a < x < b
and c < y < d. Divide it into M pieces in the x direction and N pieces in the y direction:
                                                           M X
                                                           X N
                 P ((a < X < b) ∩ (c < Y < d)) ≈                       f (xi , yj ) ∆x ∆y   (1.11)
                                                           i=1 j=1
Now,
                             N
                             X                         Z   d
                                   f (xi , yj ) ∆y ≈           f (xi , y) dy                (1.12)
                             j=1                       c
so
                                                 4
                                    M Z
                                    X         d                        Z b Z   d            
  P ((a < X < b) ∩ (c < Y < d)) ≈                 f (xi , y) dy ∆x ≈                f (x, y) dy dx   (1.13)
                                    i=1   c                             a   c
This is a double integral. We integrate with respect to y and then integrate with respect
to x. Note that there was nothing special about the order we approximated the sums as
integrals. We could also have taken the limit as the x spacing went to zero first, and then
found
                                                    5
                                            N X
                                            X M                                              N Z
                                                                                             X          b
     P ((a < X < b) ∩ (c < Y < d)) ≈                             f (xi , yj ) ∆x ∆y ≈                       f (x, yj ) dx ∆y
                                            j=1 i=1                                          j=1    a
                                                                                                                               (1.14)
                                            Z       d   Z           b            
                                       ≈                                 f (x, y) dx dy
                                                c                a
As long as we integrate over the correct region of the x, y plane, it doesn’t matter in what
order we do it.
    All of these approximations get better and better as ∆x and ∆y go to zero, so the
expression involving the integrals is actually exact:
                                    Z b Z d                  Z d Z b            
  P ((a < X < b) ∩ (c < Y < d)) =             f (x, y) dy dx =          f (x, y) dx dy (1.15)
                                            a                c                                c         a
The normalization condition is that the probability density for all values of X and Y has to
integrate up to 1:               Z ∞Z ∞
                                          f (x, y) dx dy = 1                          (1.16)
                                       −∞           −∞
Practice Problems
5.1, 5.3, 5.9, 5.13, 5.19, 5.21
                                                                         6
Thursday 4 February 2010
1.3    Collection of Formulas
                                  Discrete                               Continuous
                                                                        P ([X=x± ∆x    ∩ Y =y± ∆y
                                                                                    2 ] [       2 ])
   Definition        p(x, y) = P ([X = x] ∩ [Y = y])         f (x, y) ≈             ∆x ∆y
                            P P                                  R∞ R∞
  Normalization                x   y p(x, P
                                          y) = 1                  −∞ −∞
                                                                           f (x, Ry)Rdx dy = 1
                                             P
    Region A          P ((X, Y ) ∈ A) =          p(x, y)   P ((X, Y ) ∈ A) =          f (x, y) dx dy
                                         (x,y)∈A                              (x,y)∈A
                                      P                                     R∞
    Marginal                pX (x) = y p(x, y)                     fX (x) = −∞ f (x, y) dy
 X & Y indep iff     p(x, y) = pX (x)pY (y) for all x, y     f (x, y) = fX (x)fY (y) for all x, y
If we do the y integral first, that means that, for each value of x, we need work out the range
of y values. This means slicing up the triangle along lines of constant x and integrating in
the y direction for each of those:
                                                7
The conditions for a given x are y ≥ 0, y > x, and y ≤ 1. So the minimum possible y is x
and the maximum is 1, and the integral is
                                         Z 1 Z 1            
                           P (X < Y ) =           f (x, y) dy dx                  (1.21)
                                            0       x
We integrate x over the full range of x, from 0 to 1, because there is a strip present for each
of those xs. Note that the limits of the y integral depend on x, which is fine because the y
integral is inside the x integral.
    If, on the other hand, we decided to do the x integral first, we’d be slicing up into slices
of constant x and considering the range of x values for each possible y:
                                                8
Now, given a value of y, the restrictions on x are x ≥ 0, x < y, and x ≤ 1, so the minimum
x is 0 and the maximum is y, which makes the integral
                                          Z 1 Z y            
                           P (X < Y ) =            f (x, y) dx dy                     (1.22)
                                           0           0
The integral over y is over the whole range from 0 to 1 because there is a constant-y strip
for each of those values.
    Note that the limits on the integrals are different depending on which order the integrals
are done. The fact that we get the same answer (which should be apparent from the fact
that we’re covering all the points of the same region) is apparently called Fubini’s theorem.
In the case of continuous rvs, we replace the pmf with the pdf and the sums with integrals:
                                       Z ∞Z ∞
                        E(h(X, Y )) =           h(x, y) f (x, y) dx dy                (2.2)
                                         −∞    −∞
                                                   9
(All of the properties we’ll show in this section work for both discrete and continuous dis-
tributions, but we’ll do the explicit demonstrations in one case; the other version can be
produced via a straightforward conversion.)
    In particular, we can define the mean value of each of the random variables as before
                                                µX = E(X)                                           (2.3a)
                                                µY = E(Y )                                          (2.3b)
Note that each of these is a function of only one variable, and we can calculate the expected
value of such a function without having to use the full joint probability distribution, since
e.g.,                 XX                      X       X             X
          E(h(X)) =            h(x) p(x, y) =    h(x)     p(x, y) =    h(x) pX (x) .    (2.5)
                         x   y                      x         y             x
Which makes sense, since the marginal pmf pX (x) = P (X = x) is just the probability
distribution we’d assign to X if we didn’t know or care about Y .
    A useful expected value which is constructed from a function of both variables is the
covariance. Instead of squaring (X − µX ) or (Y − µY ), we multiply them by each other:
It’s worth going explicitly through the derivation of the shortcut formula:
   We can show that the covariance of two independent random variables is zero:
                                                                                                    !!
              XX                       XX                             X              X
  E(XY ) =             x y p(x, y) =            x y pX (x) pY (y) =       x pX (x)       y pY (y)
              x    y                   x    y                         x              y
              X                        X
          =       x pX (x)µY = µY          x pX (x) = µX µY       if X and Y are independent
              x                        x
                                                                                                     (2.8)
The converse is, however, not true. We can construct a probability distribution in which X
and Y are not independent but their covariance is zero:
                                                    10
                                              y
                            p(x, y)        −1 0                    1 pX (x)
                                       −1 0 .2                    0    .2
                            x            0 .2 .2                  .2   .6
                                         1 0 .2                   0    .2
                                    pY (y) .2 .6                  .2
From the form of the marginal pmfs, we can see µX = 0 = µY , and if we calculate
                                        XX
                              E(XY ) =          x y p(x, y)                               (2.9)
                                                     x        y
we see that for each x, y combination for which p(x, y) 6= 0, either x or y is zero, and so
Cov(X, Y ) = E(XY ) = 0.
   Unlike variance, covariance can be positive or negative. For example, consider two rvs
with the joint pmf
                                              y
                            p(x, y)        −1 0                    1 pX (x)
                                       −1 .2 0                     0   .2
                            x            0 0 .6                   0    .6
                                         1 0   0                  .2   .2
                                    pY (y) .2 .6                  .2
Since we have the same marginal pmfs as before, µX µY = 0, and
                            XX
      Cov(X, Y ) = E(XY ) =        x y p(x, y) = (−1)(−1)(.2) + (1)(1)(.2) = .4         (2.10)
                                x       y
The positive covariance means X tends to be positive when Y is positive and negative when
Y is negative.
   On the other hand if the pmf is
                                                        y
                            p(x, y)                  −1 0          1 pX (x)
                                               −1    0   0        .2   .2
                            x                    0   0 .6         0    .6
                                                 1   .2 0          0   .2
                                            pY (y)   .2 .6        .2
which again has µX µY = 0, the covariance is
                            XX
    Cov(X, Y ) = E(XY ) =          x y p(x, y) = (−1)(1)(.2) + (1)(−1)(.2) = −.4        (2.11)
                                x   y
   One drawback of covariance is that, like variance, it depends on the units of the quantities
considered. If the numbers above were in feet, the covariance would really be .4 ft2 ; if you
                                                         11
converted X and Y into inches, so that 1 ft became 12 in, the covariance would become
57.6 (actually 57.6 in2 ). But there wouldn’t really be any increase in the degree to which
X and Y are correlated; the change of units would just have spread things out. The same
thing would happen to the variances of each variable; the covariance is measuring the spread
of each variable along with the correlation between them. To isolate a measure of how
correlated the variables are, we can divide by the product of their standard deviations and
define something called the correlation coëfficient:
                                                       Cov(X, Y )
                                             ρX,Y =                                                     (2.12)
                                                         σX σY
To see why the product of the standard deviations is the right thing, suppose that X and Y
                                                                                       2
have different units, e.g., X is in inches and Y is in seconds. Then µX is in inches, σX is in
inches squared, and σX is in inches. Similar arguments show that µY and σY are in seconds,
and thus Cov(X, Y ) is in inches times seconds, so dividing by σX σY makes all of the units
cancel out.
   Exercise: work out σX , σY , and ρX,Y for each of these examples.
Practice Problems
5.25, 5.27, 5.31, 5.33, 5.37, 5.39
X1 , X2 , . . . Xn independent means f (x1 , x2 , . . . xn ) = fX1 (x1 )fX2 (x2 ) · · · fXn (xn ) (3.2)
An even more special case is when each of the rvs follows the same distribution, which we
can then write as e.g., f (x1 ) rather than fX1 (x1 ). Then we say the n rvs are independent
and identically distributed or iid. Again, writing this explicitly for the continuous case,
                                                       12
This is a useful enough concept that it has a name, and we call it a random sample of size
n drawn from the distribution with pdf f (x). For example, if we roll a die fifteen times, we
can make statements about those fifteen numbers taken together.
On the other hand, if we have a single random variable X, its mean and variance are
calculated with the expected value:
                                              µX = E(X)                                     (3.5a)
                                          2
                                         σX   = E([X − µX ]2 )                              (3.5b)
So, given a set of n random variables, we could combine them in the same ways we combine
sample points, and define the mean and variance of those n numbers
                                                        n
                                                 1X
                                              X=       Xi                                   (3.6a)
                                                 n i=1
                                                    n
                                     2   1 X
                                   S =           (Xi − X)2                                  (3.6b)
                                       n − 1 i=1
Each of these is a statistic, and each is a random variable, which we stress by writing it with a
capital letter. We could then ask about quantities derived from the probability distributions
                                                   13
of the statistics, for example
                                             µX = E(X)                                (3.7a)
                                  (σX )2 = V (X) = E([X − µX ]2 )                    (3.7b)
                                             µS 2 = E(S 2 )                          (3.7c)
                                 (σS 2 )2 = V (S 2 ) = E([S 2 − µS 2 ]2 )            (3.7d)
   Of special interest is the case where the Xi s are an iid random sample. We can actually
work the means and variances just by using the fact that the expected value is a linear
operator. So
                                      n
                                           !        n              n
                                   1X            1X             1X
               µX = E(X) = E             Xi =          E(Xi ) =       µX = µX         (3.8)
                                   n i=1         n i=1          n i=1
which is kind of what we’d expect: the average of n iid random variables has an expected
value which is equal to the mean of the underlying distribution. But we can also consider
what the variance of the mean is. I.e., on average, how far away will the mean calculated
from a random sample be from the mean of the underlying distribution?
                                         !      "               #2 
                                    n                  n
                                1 X                 1 X
                    (σX )2 = V         Xi = E            Xi − µX                  (3.9)
                                n i=1               n i=1
It’s easy to get a little lost squaring the sum, but we can see the essential point of what
happens in the case where n = 2:
                                      2 !                                2 !
                     1                             1             1
      (σX )2 = E       (X1 + X2 ) − µX      =E       (X1 − µX ) + (X2 − µX )
                     2                             2             2
                                                                                    (3.10)
                   1            2   1                       1          2
             =E      (X1 − µX ) + (X1 − µX )(X2 − µX ) + (X2 − µX )
                   4                2                       4
Since the expected value operation is linear, this is
               1                 1                          1
       (σX )2 = E([X1 − µX ]2 ) + E([X1 − µX ][X2 − µX ]) + E([X2 − µX ]2 )
               4                 2                          4                       (3.11)
               1          1              1
              = V (X1 ) + Cov(X1 , X2 ) + V (X2 )
               4          2              4
But since X1 and X2 are independent random variables, their covariance is zero, and
                                          1 2     1 2  1 2
                                (σX )2 = σX    + σX   = σX                           (3.12)
                                          4       4    2
The same thing works for n iid random variables: the cross terms are all covariances which
equal zero, and we get n copies of n12 σX
                                        2
                                          so that
                                                        1 2
                                            (σX )2 =     σ                            (3.13)
                                                        n X
                                                   14
This means that if you have a random sample of size n, the sample mean will be a better
estimate of the underlying population mean the larger n is. (There are assorted anecdotal
examples of this in Devore.)
    Note that these statements can also be made about the sum
To = X1 + X2 + · · · + Xn = n X (3.14)
rather than the mean, and they are perhaps easier to remember:
and
The first result follows from linearity of the expected value; the second can be illustrated for
n = 2 as before:
  V (a1 X1 + a2 X2 ) = E [(a1 X1 + a2 X2 ) − µY ]2 = E [(a1 X1 + a2 X2 ) − (a1 µ1 + a2 µ2 )]2
                                                                                               
                                                     15
4    The Central Limit Theorem
We have shown that if you add a bunch of independent random variables, the resulting
statistic has a mean which is the sum of the individual means and a variance which is the
sum of the individual variances. There is remarkable theorem which means that if you add
up enough iid random variables, the mean and variance are all you need to know. This is
known as the Central Limit Theorem:
The same can also be said for the sample mean X = To /n, but now the mean is µ and the
             2
variance is σX /n. As a rule of thumb, the central limit theorem applies for n & 30.
    We have actually already used the central limit theorem when approximating a binomial
distribution with the corresponding normal distribution. A binomial rv can be thought of
as the sum of n Bernoulli random variables.
    You may also have seen the central limit theorem in action if you’ve considered the pmf
for the results of rolling several six-sided dice and adding the results. With one die, the
results are uniformly distributed:
If we roll two dice and add the results, we get a non-uniform distribution, with results close
to 7 being most likely, and the probability distribution declining linearly from there:
                                             16
If we add three dice, the distribution begins to take on the shape of a “bell curve” and in
fact such a random distribution is used to approximate the distribution of human physical
properties in some role-playing games:
Adding more and more dice produces histograms that look more and more like a normal
distribution:
                                            17
18
Practice Problems
5.49, 5.55, 5.57, 5.61, 5.65, 5.67
19