0% found this document useful (0 votes)
10 views

Lecture 03

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Lecture 03

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

STATS 300C: Theory of Statistics Spring 2023

Lecture 3 — April 10
Lecturer: Prof. Emmanuel Candès Scribe: Leda Liang

 Warning: These notes may contain factual and/or typographic errors. They are
based on Emmanuel Candès’s course from 2018 and 2023, and scribe notes written
by Logan Bell, John Cherian, Paula Gablenz, and David Ritzwoller.

3.1 Recap: Bonferroni and Fisher Have Drawbacks


In the previous two lectures, we have introduced and discussed multiple hypothesis testing.
We have been interested in testing the global null, and to this end we considered Bonferroni’s
method and Fisher’s combination test, but each of these has its pros and cons. By examining
their performance in the independent Gaussian model,1 we determined the following:
• If our data exhibit strong, sparsely distributed signals, then Bonferroni’s
method excels and is optimal in the “needle in a haystack” setting in which one µi
is nonzero, but Fisher’s method performs very poorly.

• If our data exhibit small, widely distributed signals, Fisher’s method excels and
is optimal as expected noise power overtakes expected signal power, but Bonferroni’s
method is powerless.
In light of these facts, we would like a test that combines the strengths of Bonferroni and
Fisher. This lecture will examine a few such methods. We will proceed by considering n
hypothesis H0,1 , . . . , H0,n and p-values pT1 , . . . , pn such that pi ∼ Unif(0, 1) under H0,i , and
we will be testing the global null H0 = ni=1 H0,i .

3.2 Simes Test


Simes test, introduced independently by Simes [? ] in 1986 and by Eklund [? ] in the early
1960s, is a modification of Bonferroni’s test that is less conservative. Suppose we choose
a significance level α and we order our p-values as p(1) ≤ · · · ≤ p(n) . Bonferroni’s method
considers only the smallest p-value, rejecting H0 if p(1) ≤ αn . Simes test extends this idea,
rejecting if any p(i) ≤ α ni . Equivalently, we compute the Simes statistic,
n  n
Tn = min p(i) ,
i=1 i
and reject if Tn ≤ α. Note that the factor of n/i is an adjustment factor that is 1 for the
largest p-value, but n for the smallest p-value. The sketch in Figure 3.1 illustrates the test.
1 ind
Tn That is, our data are sampled from Yi ∼ N (µi , 1) random variables with the global hypothesis H0n =
n
i=1 H0,i , where H0,i : µi = 0, and alternative hypothesis H1 that there exists an i such that µi = µ > 0.

3-1
STATS 300C Lecture 3 — April 10 Spring 2023

In the left panel, no p-value falls below the critical line, so H0 would not be rejected. In the
right panel, there exists a p-value which falls below the critical line, and therefore Simes test
would reject H0 .

Figure 3.1. Sketch of an example where the Simes test would not reject H0 (left panel) and where it would
reject (right panel). p-values are displayed as grey dots, the critical line α ni is shown in grey. The p-value
below the critical line in the right panel is shown as a red dot.

As we observed earlier, Simes test extends Bonferroni’s method to consider every p-value
rather than just the smallest one. As such, its rejection region contains Bonferroni’s rejection
region. We can visualize the case n = 2, shown in Figure 3.2 below, in which Simes’s rejection
region is shaded in gray while Bonferroni’s rejection region is contained within, depicted by
hatching lines.

Figure 3.2. Rejection regions for Simes and Bonferroni procedures in the case n = 2.

In the above example, the Simes test rejects if p(1) ≤ α2 or p(2) ≤ α, while Bonferroni’s
method rejects if p(1) ≤ α2 . Though the Simes test is less conservative than Bonferroni’s test,
it is not too liberal. The following theorem shows that the size of Simes test is at most α
when the pi are independent under H0 .
Theorem 1. Under H0 and independence of the pi ’s, Tn ∼ Unif(0, 1).
Proof. We will show this by induction on n, following Simes [? ]. In the base case n = 1,
under H0 , it is clear that T1 = p1 ∼ Unif(0, 1). Now assume that Tn−1 ∼ Unif(0, 1) for some

3-2
STATS 300C Lecture 3 — April 10 Spring 2023

n. We will show that P(Tn ≤ α) = α for α ∈ [0, 1]. First, we split the probability into the
case p(n) ≤ α and p(n) > α.
   
n  n n−1  n
P(Tn ≤ α) = P min p(i) ≤ α = P(p(n) ≤ α) + P min p(i) ≤ α, p(n) > α (3.1)
i=1 i i=1 i

Now recall that the density of the n-th order statistic p(n) is given by f (t) = ntn−1 for
t ∈ [0, 1]. We can thus compute each term individually. The first term is straightforward:
Z α
P(p(n) ≤ α) = ntn−1 dt = αn .
0

For the second term, we observe that whenever t ∈ (0, 1],


n−1 n n−1 p(i) n − 1 α n−1
min p(i) ≤ α ⇐⇒ min · ≤ · .
i=1 i i=1 i t t n
Moreover, conditional on p(n) = t, the other p-values are i.i.d. Unif(0, t) random variables.
Thus, using the inductive hypothesis,
     
n−1  n n−1  n 
P min p(i) ≤ α, p(n) > α = E P min p(i) ≤ α | p(n) I p(n) > α
i=1 i i=1 i
Z 1  
α n−1
= P Tn−1 ≤ · f (t) dt
α t n
Z 1
α n − 1 n−1
= · nt dt
α t n
Z 1
=α (n − 1)tn−2 dt
α
= α(1 − αn−1 )
= α − αn .

Thus, (3.1) may be continued as


 
n−1  n
P(Tn ≤ α) = P(p(n) ≤ α) + P min p(i) ≤ α, p(n) > α
i=1 i
= αn + α − αn
= α.

We will see next week that the Simes test also has size at most α under a sort of positive
dependence [? ]. We also note that, with respect to power, Simes test is similar to that of
Bonferroni. It is powerful for data which exhibit a few strong effects, and it has moderate
power for many mild effects.

3-3
STATS 300C Lecture 3 — April 10 Spring 2023

3.3 Tests Based on Empirical CDF’s


Another approach to aggregating the effects of the p-values is to consider the resulting
empirical CDF and compare it to the theoretical CDF one would observe under H0 .

Definition 1. The empirical CDF of p1 , . . . , pn is


1
F̂n (t) = #{i : pi ≤ t}.
n
Intuitively, the empirical CDF gives the fraction of the p-values that are at most some
threshold value t. In our case, under H0 we have that I (pi ≤ t) ∼ Ber(t), so we should
have nF̂n (t) ∼ Bin(n, t) and thus EH0 (F̂n (t)) = t. To test H0 , we would then measure
the difference between this theoretical distribution and what we actually observe. If the
difference between our empirical observation F̂n (t) and the expected observation t is too
large, then we would reject H0 . We consider three tests based on the empirical CDF: the
Kolmogorov-Smirnov test, the Anderson-Darling test, and Tukey’s second-level significance
test.

3.3.1 Kolmogorov-Smirnov Test


The idea behind the Kolmogorov-Smirnov test is to scan over the empirical CDF and re-
ject if the distance between F̂n (t) and t is ever “too large.” More precisely, we define the
Kolmogorov-Smirnov (KS) test statistic as follows.

Definition 2. The Kolmogorov-Smirnov (KS) test statistic is defined as

KS = sup|F̂n (t) − t|.


t

We reject H0 if KS exceeds some threshold. Another option is to consider a one-sided


statistic, KS+ = supt (F̂n (t) − t). This statistic is used when we care especially about small
p-values (as we often do), as if many small p-values are present, then F̂n (t) − t will be large
and positive.
In real life, finding the correct threshold is tricky since we need to know the theoretical
distribution of KS under the global null hypothesis. One might use simulations or asymptotic
calculations instead. A useful inequality developed by Massart [? ] shows that the tail of
the KS statistic is usually sub-Gaussian, and thus decays fast.

Theorem 2 (Massart’s Inequality). Under H0 and independence,


2
P(KS+ ≥ u) ≤ 2e−2nu
q
log 2
for u ≥ 2n
.

A lot of work went into establishing this very nice bound, and we will not prove it here.
Interested students are referred to Massart [? ].

3-4
STATS 300C Lecture 3 — April 10 Spring 2023

3.3.2 Anderson-Darling Test


Instead of looking at the maximum gap between the empirical CDF and the expected CDF
under the global null hypothesis, as in the case of the KS statistic, another idea is to consider
the cumulative gap. This idea can be expressed by a quadratic test statistic.
Definition 3. A quadratic test statistic is one of the form
Z 1
2
A =n (F̂n (t) − t)2 ω(t) dt,
0

where ω(t) ≥ 0 is a weight function.


Note that such a statistic uses the squared difference rather than the absolute difference,
meaning small deviations are treated as practically negligible compared to larger gaps.2
The weight function ω(t) exists to grant greater significance to gaps at certain thresholds.
Common weight functions are ω(t) = 1, which yields the Cramér-von Mises statistic, and
ω(t) = [t(t − 1)]−1 , which gives the Anderson-Darling statistic. The Anderson-Darling
statistic,
Z 1
2 (F̂n (t) − t)2
A =n dt,
0 t(t − 1)
was introduced by Anderson and Darling [? ] in 1954, and it puts more weight on small and
large p-values when compared with the Cramér-von Mises statistic. For statistical intuition,
one can think of the Anderson-Darling statistic as “averaging” the squared z-score over t.
This is because, under the global null, nF̂n (t) ∼ Bin(n, t), and thus Var(F̂n (t)) ∝ t(1 − t),
so the integrand (F̂n (t) − t)2 [t(t − 1)]−1 is similar to a squared z-score.
Notice that, since F̂n is piecewise constant, we can explicitly compute the Anderson-
Darling statistic A2 as
Z 1
2 (F̂n (t) − t)2
A =n dt
0 t(t − 1)
Z p(1) n−1 Z p(k+1) Z 1 !
t2 X (k/n − t)2 (1 − t)2
=n dt + dt + dt
0 t(t − 1) k=1 p (k)
t(t − 1) p (n)
t(t − 1)

= n p(1) + log(1 − p(1) )

n−1 
p(k+1) − 1 k2 p(k) (p(k+1) − 1)
     
X 2k
+ (p(k+1) − p(k) ) + 1 − log + 2 log
k=1
n p(k) − 1 n p(k+1) (p(k) − 1)
!
+ 1 − p(n) + log(p(n) )
n
X 2k − 1  
= −n − log(p(k) ) + log(1 − p(n+1−k) ) . (3.2)
k=1
n
2
A quadratic statistic is thus akin to a weighted L2 -norm of F̂n (t) − t, while the KS statistic is like the

L -norm.

3-5
STATS 300C Lecture 3 — April 10 Spring 2023

This formula lets us connect the Anderson-Darling P test to test statistics we have seen previ-
n
Pn that Fisher’s test statistic is TF = −2 i=1 log(pi ), and Pearson’s test statistic
ously. Recall
is TP = 2 i=1 log(1 − pi ). The formula (3.2) shows that the Anderson-Darling statistic is a
combination of Fisher’s test and Pearson’s test. Compared to Fisher’s test, the Anderson-
Darling test assigns greater weight to the p-values that are in the bulk because it re-weights
the p-values depending on their rank, something that Fisher’s test does not do. This allevi-
ates the high sensitivity to small p-values that Fisher’s test experiences.

3.3.3 Tukey’s Second-Level Significance Testing: Higher-Criticism


Statistic
As we have seen, the Kolmogorov-Smirnov test looks for the maximum distance between the
empirical CDF and its expected value under the global null hypothesis, while the Anderson-
Darling test integrates the differences instead. We now combine the two approaches.
When testing n hypotheses at level α, we would expect nα tests to be significant, while
the observedp number of significant tests would be nF̂n (α) and the standard deviation would
be given by nα(1 − α). Thus, as previously, we can construct a z-score, and the overall
significance at level α would be

(# significant tests at level α) − expected nF̂n (α) − nα


= p .
SD nα(1 − α)

According to Donoho and Jin [? ], Tukey [? ] proposed in his class notes at Princeton
to use a second-level significance test. Tukey’s test combines the Kolmogorov-Smirnov test
and the Anderson-Darling test by taking the maximum of the difference F̂n (t) − t, as in the
Kolmogorov-Smirnov test, but also standardizes the difference like in the Anderson-Darling
test. Specifically, Tukey proposed using the higher-criticism statistic.

Definition 4. The higher-criticism statistic is

F̂n (α) − α
HC∗n = max p .
0<α≤α0 α(1 − α)/n

Theoretical Analysis
Donoho and Jin [? ], based on Jin [? ] and Ingster [? ], provide a theoretical anaylsis of
Tukey’s higher-criticism statistic. The higher-criticism statistic “scans” across the signifi-
cance levels for departures from H0 . Hence, a large value of HC∗n indicates significance of
an overall body of tests. To understand the power of the higher-criticism statistic and to
compare it to Bonferroni’s method, we will next study sparse mixtures.

3-6
STATS 300C Lecture 3 — April 10 Spring 2023

3.4 Sparse Mixtures


Donoho and Jin [? ], based on Jin [? ] and Ingster [? ], consider n tests of H0,i vs. H1,i with
independent test statistics Xi . In the original model, the hypotheses are

H0,i : Xi ∼ N (0, 1)
H1,i : Xi ∼ N (µi , 1) µi > 0.

We are interested in possibilities within H1 with a small fraction of non-null hypotheses.


Rather than directly assuming that there is some amount of nonzero means under H1 , we
assume that our samples follow a mixture of N (0, 1) and N (µ, 1), with µ > 0 fixed and with
some mixture parameter ε. This simple model with equals means can be written as
i.i.d.
H0,i : Xi ∼ N (0, 1)
i.i.d.
H1,i : Xi ∼ (1 − ε)N (0, 1) + εN (µ, 1).

The expected number of non-nulls under H1 is nε. If ε = 1/n, then the above would become
the “needle in a haystack” problem: on average, there would be one coordinate with µ
nonzero.
If ε and µ were known, then the optimal test would be the likelihood ratio test. The
likelihood ratio test under the sparse mixture model is
n h i
2
Y
L= (1 − ε) + εeµXi −µ /2 .
i=1

The asymptotic analysis of Ingster [? ] and Jin [? ] specifies the dependency of ε and µ
on n as follows:
1
εn = n−β <β<1
p 2
µn = 2r log n 0 < r < 1

The parameter β controls the set of non-nulls in the range 1 to n, and thus the sparsity
of the alternatives, while r paramterizes the mean shift. If β were large, then our problem
would be very sparse, while if β were small, it would be mildly sparse. If r = 1, then we
get the detection threshold we have seen for Bonferroni. Hence, the “needle in a haystack”
problem corresponds to β = 1 and r = 1.
Ingster [? ] and Jin [? ] find that there is a threshold curve for r of the form
(
β − 1/2 1/2 < β < 3/4
ρ∗ (β) = √ 2
(1 − 1 − β) 3/4 ≤ β ≤ 1

such that
1. If r > ρ∗ (β), then we can adjust the NP test to achieve

P0 (Type I Error) + P1 (Type II Error) → 0;

3-7
STATS 300C Lecture 3 — April 10 Spring 2023

2. if r ≤ ρ∗ (β), then for any test,

lim inf P0 (Type I Error) + P1 (Type II Error) ≥ 1.


n

Unfortunately, we generally cannot use the NP test since we do not know ε or µ. However,
Donoho and Jin [? ], based on Ingster [? ] and Jin [? ], show that Tukey’s higher-criticism
statistic, which does not require knowledge of ε or µ, asymptotically achives the optimal
detection threshold, with

n(F̂n (α) − α)
pi = Φ(Xi ) = P(N (0, 1) > Xi ), and HC∗n = max p .
α≤α0 α(1 − α)

To better understand the results, Figure 3.3 visualizes the detection thresholds for NP
and Bonferroni.

Figure 3.3. Detection thresholds for NP and Bonferroni tests.

If the amplitude of the signal is above the solid black curve (achievable with NP), then
the NP test has full power, that is, we asymptotically separate. However, if it is below the
curve, we asymptotically merge, that is, every test is no better than flipping a coin. The
dashed black curve in Figure 3.3 shows the detection threshold for Bonferroni. Bonferroni’s
method achieves the optimal threshold for β ∈ [3/4, 1], corresponding to the sparsest setting,
but has suboptimal threshold if β ∈ [1/2, 3/4], which has less sparsity. This is seen in the
figure as the dashed Bonferroni curve and the solid NP curve align for β ≥ 3/4, but separate
below. Hence, in the area between the Bonferroni and the NP curve, the NP test has full
power, while Bonferroni is no better than a coin toss.
Bonferroni’s threshold for 1/2 ≤ β ≤ 1 is
p
ρBon (β) = (1 − 1 − β)2 .

3-8
STATS 300C Lecture 3 — April 10 Spring 2023

Bonferroni is powerless if r < ρBon . Bonferroni correctly detects non-nulls if the maximum
of non-nulls is greater than that of nulls, i.e., roughly
p p p
max Xi ≃ 2r log n + 2 log n1−β > 2 log n
non null
√ p
⇐⇒ r + 1 − β > 1
p
⇐⇒ r > (1 − 1 − β)2 = ρBon (β).
By comparison, the higher-criticism test rejects when HC∗n is large, i.e., when the p-values
tend to be a bit too small. Next, we will consider “how small” the p-values should be, which
will be discussed in more detail in the next lecture.
For now, consider the empirical process

n(F̂n (t) − t)
Wn (t) = p ,
t(1 − t)
where Wn (t) converges in distribution to N (0, 1) for each t. Empirical process theory tells
us that

1. { n(Fn (t) − t)}0≤t≤1 converges in distribution to a Brownian bridge, and
√ p
2. max1/n≤t≤α0 Wn (t)/ 2 log log n → 1

This suggests the threshold 2 log log n for the HC statistic.
Theorem 3 (Donoho and Jin [? ]). If we reject when HC∗n ≥ (1 + ε)2 log log n then for
p

any alternative H1 in which r > ρ∗ (β),


P0 (Type I Error) + P1 (Type II Error) → 0
In practice, however, the asymptotic approximation to the critical value of the higher-
criticism statistic may not be accurate in finite samples. Furthermore, in finite samples,
the behavior of the process Wn (t) may not be well approximated by a Brownian bridge
for small values of t; the process Wn (t) is heavy-tailed near 0. In particular, for n large
and t small, a Poi(np) distribution provides a more accurate approximation to F̂n (t) than a
suitably centered and scaled Gaussian. As a result, the tails of Wn (t) will be much heavier
for t close to zero than for t away from zero, and so the behavior of sup1/n≤t≤α0 Wn (t) will be
highly dependent on the behavior of Wn (t) for t close to zero, which we can see in Figure 3.4.
Therefore, critical values computed with simulation may be conservative and result in a test
that performs more similarly to Bonferroni’s method than to a test using Fisher’s statistic.
Variations of Tukey’s higher-criticism test that, in part, adjust for the heavy tails near 0
have been proposed and recently studied. One variation is the Berk-Jones statistic, which
standardizes the binomial counts using a log-likelihood ratio transformation rather than a
normal approximation.
Definition 5. The Berk-Jones statistic is given by
BJ+
n = max nD(p(i) , i/n),
1≤i≤n/2

where D(p0 , p1 ) = p0 log(p0 /p1 ) + (1 − p0 ) log((1 − p0 )/(1 − p1 )) gives the Kullback-Leibler


distance between Ber(p0 ) and Ber(p1 ) distributions.

3-9
STATS 300C Lecture 3 — April 10 Spring 2023

Figure 3.4. Finite sample behavior of Wn (t).

The Berk-Jones statistic also achieves the optimal detection boundary asymptotically,
and it produces subexponential tails under H0 , regardless of t. Walther ran simulations to
compare the performance of HC∗n and BJ+ n and plotted the results in the graphs shown in
Figure 3.5. The plots demonstrate that HC∗n outperforms BJ+n when β ∈ [3/4, 1], the sparsest
setting, while BJ+
n is better when β is small.

Figure 3.5. Power of HC∗n (red) and BJ+ n (black) as a function of the sparsity parameter β. The left plot
shows power for sample size n = 104 , the right plot for n = 106 . In the range β ∈ (1/2, 3/4), we see that
HC∗n has low, but increasing power, while BJ+n has high, but decreasing power. Their power curves cross at
β = 3/4, and in the range β ∈ [3/4, 1], we see that HC∗n outperforms BJ+ n
.

Another variation, proposed by Walther [? ], reintroduces an element of integration. This


variation combines the power of HC∗n when β ∈ [3/4, 1) by looking at the smallest p-value
(namely, by performing the likelihood ratio test over the interval (0, p(1) ]), and the power of
BJ+
n when β ∈ (1/2, 3/4) by employing a likelihood ratio test on the interval (0, n
−4(β−1/2)
).
This gives the average likelihood ratio (ALR).

3-10
STATS 300C Lecture 3 — April 10 Spring 2023

Figure 3.6. Power of ALR (green), HC∗n (red), and BJ+ n (black) as a function of the sparsity parameter β.
The left plot shows power for sample size n = 104 , the right plot for n = 106 .

Definition 6. The average likelihood ratio is given by


n
X
ALR = wi LRi ,
i=1

1

where LRi = exp nD(p(i) , i/n) and the weights wi = 2i log(n/3)
.

As with HC∗n and BJ+ n , ALR also achieves the optimal detection threshold asymptotically.
Plotting the power of ALR against HC∗n and BJ+ n gives Figure 3.6. We see that it achieves
∗ +
essentially the better of HCn and BJn for any β.

3-11

You might also like