AEPSHEP Lecture1
AEPSHEP Lecture1
Practical Statistics
Practical Statistics
ysi c ist s
a r ti cl e Ph
For P
Nicolas Berger (LAPP Annecy) Lecture 1
Statistics are everywhere “There are three types of lies - lies, damn
lies, and statistics.” – Benjamin Disraeli
And Physics ? “If your experiment needs statistics, you ought to have
done a better experiment” – E. Rutherford 3/
Introduction
Statistical methods play a critical role in
many areas of physics
“5s”
4/
Introduction
New Physics ?
3.9σ ? 2.1σ ?
JHEP 09 (2016) 1 5/
Introduction
3.9σ ? 2.1σ ?
JHEP 09 (2016) 1 5/
Introduction
Precision measurements are another window into BSM effects
→ How to compute (and interpret) measurement intervals
→ How to model systematic uncertainties ?
→ How to get the smallest achievable uncertainties ?
Decays
Detector response
Reconstruction
Image Credits:
S. Höche,
SLAC-PUB-16160
Calorimeter Readout
10
/
Measurement Errors: Energy measurement
Example: measuring the energy
of a photon in a calorimeter
Calorimeter Readout
g Energy
deposit
Perfect
case
10
/
Measurement Errors: Energy measurement
Example: measuring the energy
of a photon in a calorimeter
Calorimeter Readout
g
Perfect Real
case life
10
/
Measurement Errors: Energy measurement
Example: measuring the energy
of a photon in a calorimeter
Measure leakage behind calorimeter
g
Perfect Real
case life
10
/
Measurement Errors: Energy measurement
Example: measuring the energy
of a photon in a calorimeter
Measure leakage behind calorimeter
g
Perfect Real
case life
http://www.phdcomics.com/comics/archive.php?comicid=1489
Phys. Rev. D 91, 012006
11
/
Quantum Randomness: H®ZZ*®4l
http://www.phdcomics.com/comics/archive.php?comicid=1489
Phys. Rev. D 91, 012006
View online
12
/
Quantum Randomness: H®ZZ*®4l
http://www.phdcomics.com/comics/archive.php?comicid=1489
Phys. Rev. D 91, 012006
14
/
Probability Distributions
Probability distribution :
{ Pi } for i = 0, 1, 2
Properties
• Pi > 0
• Σ Pi=1
15
/
Continuous Variables: PDFs
Continuous variable: can consider per-bin probabilities pi, i=1.. nbins
Contours: P(x,y)
16
x /
Continuous Variables: PDFs
Continuous variable: can consider per-bin probabilities pi, i=1.. nbins
Contours: P(x,y)
16
x /
PDF Mean
PDF Properties: Mean
E(X) = <X> : Mean of X – expected outcome
on average over many measurements
⟨ X ⟩ = ∑ xi Pi or
i
⟨ X ⟩ = ∫ x P ( x) dx
1
x̄ = ∑ x i
n i
→ Property of the sample
→ approximates the PDF mean. 17
/
PDF Properties: (Co)variance
Variance of X: 2
Var( X )=⟨ ( X − ⟨ X ⟩ ) ⟩ RMS
→ Average square of deviation from mean
→ RMS(X) = ÖVar(X) = σX standard deviation
Can be approximated by sample variance:
1
σ^ 2 = ∑ i
n−1 i
( x − x̄) 2
Cov(x, y) > 0
Covariance of X and Y: y
Cov ( X ,Y )=⟨ ( X − ⟨ X ⟩ ) (Y − ⟨ Y ⟩ ) ⟩
x
→ Large if variations of X and Y are “synchronized”
Cov ( X ,Y )
Correlation coefficient ρ= -1 ≤ ρ ≤ 1
√ Var( X ) Var(Y ) 18
/
PDF Properties: (Co)variance
Variance of X: 2
Var( X )=⟨ ( X − ⟨ X ⟩ ) ⟩ RMS
→ Average square of deviation from mean
→ RMS(X) = ÖVar(X) = σX standard deviation
Can be approximated by sample variance:
1
σ^ 2 = ∑ i
n−1 i
( x − x̄) 2
Cov(x, y) < 0
Covariance of X and Y: y
Cov ( X ,Y )=⟨ ( X − ⟨ X ⟩ ) (Y − ⟨ Y ⟩ ) ⟩
x
→ Large if variations of X and Y are “synchronized”
Cov ( X ,Y )
Correlation coefficient ρ= -1 ≤ ρ ≤ 1
√ Var( X ) Var(Y ) 18
/
PDF Properties: (Co)variance
Variance of X: 2
Var( X )=⟨ ( X − ⟨ X ⟩ ) ⟩ RMS
→ Average square of deviation from mean
→ RMS(X) = ÖVar(X) = σX standard deviation
Can be approximated by sample variance:
1
σ^ 2 = ∑ i
n−1 i
( x − x̄) 2
Cov(x, y) < 0
Covariance of X and Y: y
x
→ Large if variations of X and Y are “synchronized”
Cov ( X ,Y )
Correlation coefficient ρ= -1 ≤ ρ ≤ 1
√ Var( X ) Var(Y ) 18
/
“Linear” vs. “non-linear” correlations
For non-Gaussian cases, the Correlation coefficient ρ is not the whole story:
2 ρ σ 1σ 2
tan 2 α = 2 2
σ 1− σ 2
Source: Wikipedia
x
20
x /
Gaussian PDF
Gaussian distribution:
2
( x− X 0 )
1 −
2 σ2
s
G ( x ; X 0 ,σ )= e
σ √2π X0
→ Mean : X0
→ Variance : σ2 (⇒ RMS = σ)
1
Generalize to N dimensions: 1 − ( x− X 0)T C −1 ( x− X 0 )
2
G ( x ; X 0 , C )= N 1/ 2
e
→ Mean : X0 [(2 π ) |C|] 2ρ σ1 σ2
tan 2 α =
→ Covariance matrix : σ 21 − σ 22
x2
α
C=
[
Var ( X 1 ) Cov ( X 1 , X 2 )
Cov ( X 2 , X 1 ) Var ( X 2 ) ]
=
[ σ 21
ρσ1σ2
ρ σ 1 σ2
σ
2
2
] x1
21
/
Central Limit Theorem (*) Assuming σX < ∞
and other regularity
conditions
For an observable X with any(*) distribution, one has
n
1 σX
x̄ = ∑ xi ∼ G ( ⟨ X ⟩ ,
n→ ∞
)
n i=1 √n
What this means:
• The average of many measurements is always Gaussian, whatever the
distribution for a single measurement
• The mean of the Gaussian is the average of the single measurements
• The RMS of the Gaussian decreases as Ön : smaller fluctuations when
averaging over many measurements
n
∑ xi
n→ ∞
Another version: ∼ G( n ⟨ X ⟩ , √ n σ X)
i=1
n
1
x̄ = ∑ x i
n i=1
x̄
Distribution becomes Gaussian, although very non-Gaussian originally
23
Distribution becomes narrower as expected (as 1/√n ) /
Central Limit Theorem in action
n
1
x̄ = ∑ x i
n i=1
x̄
Distribution becomes Gaussian, although very non-Gaussian originally
23
Distribution becomes narrower as expected (as 1/√n ) /
Central Limit Theorem in action
n
1
x̄ = ∑ x i
n i=1
x̄
Distribution becomes Gaussian, although very non-Gaussian originally
23
Distribution becomes narrower as expected (as 1/√n ) /
Central Limit Theorem in action
n
1
x̄ = ∑ x i
n i=1
x̄
Distribution becomes Gaussian, although very non-Gaussian originally
23
Distribution becomes narrower as expected (as 1/√n ) /
Central Limit Theorem in action
n
1
x̄ = ∑ x i
n i=1
x̄
Distribution becomes Gaussian, although very non-Gaussian originally
23
Distribution becomes narrower as expected (as 1/√n ) /
Central Limit Theorem in action
n
1
x̄ = ∑ x i
n i=1
x̄
Distribution becomes Gaussian, although very non-Gaussian originally
23
Distribution becomes narrower as expected (as 1/√n ) /
Central Limit Theorem in action
n
1
x̄ = ∑ x i
n i=1
x̄
Distribution becomes Gaussian, although very non-Gaussian originally
23
Distribution becomes narrower as expected (as 1/√n ) /
Central Limit Theorem in action
n
1
x̄ = ∑ x i
n i=1
x̄
Distribution becomes Gaussian, although very non-Gaussian originally
23
Distribution becomes narrower as expected (as 1/√n ) /
P(|x – X0| > Zσ)
Gaussian Quantiles Z
1 0.317
Consider z= (
x− X 0
σ ) “pull” of x
2
3
0.045
0.003
z
Φ ( z) = ∫− ∞ G(u ; 0,1) du
24
/
Z P(|x – X0| > Zσ)
Gaussian Quantiles
1 0.317
Consider z= (
x− X 0
σ ) “pull” of x
2
3
0.045
0.003
z
Φ ( z) = ∫− ∞ G(u ; 0,1) du
24
/
Z P(|x – X0| > Zσ)
Gaussian Quantiles
1 0.317
Consider z= (
x− X 0
σ ) “pull” of x
2
3
0.045
0.003
z
Φ ( z) = ∫− ∞ G(u ; 0,1) du
24
/
σ1
Chi-squared
Distribution depends on n :
Rule of thumb:
2
χ / n should be ≾ 1 25
/
σ1
Chi-squared
Distribution depends on n :
Rule of thumb:
2
χ / n should be ≾ 1 25
/
Histogram Chi-squared
Histogram χ2 with respect to a reference shape:
• Assume an independent Gaussian distribution in each bin
• Degrees of freedom = (number of bins) – (number of fit parameters)
26
/
Histogram Chi-squared
Histogram χ2 with respect to a reference shape:
• Assume an independent Gaussian distribution in each bin
• Degrees of freedom = (number of bins) – (number of fit parameters)
26
/
Histogram Chi-squared
Histogram χ2 with respect to a reference shape:
• Assume an independent Gaussian distribution in each bin
• Degrees of freedom = (number of bins) – (number of fit parameters)
26
/
Statistical Modeling
27
/
Example 1: Z counting Phys. Lett. B 759 (2016) 601
35000 ± 187
175 ± 8
ndata −N bkg
fid
σ = (81 ± 2) pb-1
C fid L
0.552 ± 0.006
28
“Single bin counting” : only data input is ndata. /
Example 2: ttH→bb arXiv:2111.06712
29
/
Example 3: unbinned modeling ATLAS-CONF-2017-045
n
−( S + B) ( S + B) S : # of events from signal process
P ( n ; S , B)=e
n! B : # of events from bkg. process(es)
32
/
Multiple counting bins
Count in bins of a variable ⇒ histogram n1 ... nN.
(N : number of bins)
Per-bin fractions (=shapes)
of Signal and Background
N ni
( S f S , i + B f B , i)
P ({ni } ; S , B) = ∏ e
−( S f S , i + B f B , i )
i=1 ni !
Poisson distribution in each bin
→ HEP: generally good modeling from simulation, although some uncertainties need
to be accounted for.
i=1 ni !
36
/
What a PDF is for
Model describes the distribution of the observable: P(data; parameters)
⇒ Possible outcomes of the experiment, for given parameter values
Can draw random events according to PDF : generate pseudo-data
P ( λ =5) 2, 5, 3, 7, 4, 9, ….
Each entry = separate “experiment”
Generate
Unbinned
37
/
What a PDF is also for: Likelihood
Model describes the distribution of the observable: P(data; parameters)
⇒ Possible outcomes of the experiment, for given parameter values
We want the other direction: use data to get information on parameters
P ( λ =?) 2
Estimate
?
Maximum Likelihood
μ^ = arg max L(μ )
Estimator (MLE) μ̂ :
S = 0.5
L(S; n=5)
Observed L(S) max
P(n; S)
Value n=5
@ Ŝ = 5
given n=5
S=5 S = 20
n s
n
MLE: the value of μ for which this data was most likely to occur
The MLE is a function of the data – itself an observable
No guarantee it is the true value (data may be “unlikely”) but sensible estimate 39
/
Gaussian case
data
data
data
-2 log Likelihood:
( )
N bins 2
ni – y i (μ )
λ (μ ) = −2 log L(μ )= ∑ σi
i=1
HEP practice:
● MINUIT (C++ library within ROOT, numerical gradient descent)
-2 log Likelihood:
( )
N bins 2
ni – y i (μ )
λ (μ ) = −2 log L(μ )= ∑ σi
i=1
HEP practice:
● MINUIT (C++ library within ROOT, numerical gradient descent)
-2 log Likelihood:
( )
N bins 2
ni – y i (μ )
λ (μ ) = −2 log L(μ )= ∑ σi
i=1
HEP practice:
● MINUIT (C++ library within ROOT, numerical gradient descent)
-2 log Likelihood:
( )
N bins 2
ni – y i (μ )
λ (μ ) = −2 log L(μ )= ∑ σi
i=1
HEP practice:
● MINUIT (C++ library within ROOT, numerical gradient descent)
-2 log Likelihood:
( )
N bins 2
ni – y i (μ )
λ (μ ) = −2 log L(μ )= ∑ σi
i=1
HEP practice:
● MINUIT (C++ library within ROOT, numerical gradient descent)
-2 log Likelihood:
( )
N bins 2
ni – y i (μ )
λ (μ ) = −2 log L(μ )= ∑ σi
i=1
HEP practice:
● MINUIT (C++ library within ROOT, numerical gradient descent)
-2 log Likelihood:
( )
N bins 2
ni – y i (μ )
λ (μ ) = −2 log L(μ )= ∑ σi
i=1
HEP practice:
● MINUIT (C++ library within ROOT, numerical gradient descent)
If you have a computer, please install anaconda before the start of the class.
This provides a consistent installation of python, JupyterLab, etc.
→ Alternatively, you can also install JupyterLab as a standalone package.
→ Another solution is to run the notebooks on the public jupyter servers at
mybinder.org. This will probably be slower but avoids a local install.
● Use the notebook links if you have a local install: save the notebook
locally and open it with your JupyterLab installation.
● Use the binder links to use public servers: the links will open the
notebooks in a remote server sessions in your browser.
Notebooks with solutions to the exercises will be posted after the lectures.
Please let me know in case of technical issues running the notebooks!
Extra Slides
45
/
Error Bars
Strictly speaking, the uncertainty is given by the model :
→ Bin central value ~ mean of the bin PDF
→ Bin uncertainty ~ RMS of the bin PDF
The data is just what it is, a simple observed point.
46
/
Error Bars
Strictly speaking, the uncertainty is given by the model :
→ Bin central value ~ mean of the bin PDF
→ Bin uncertainty ~ RMS of the bin PDF
The data is just what it is, a simple observed point.
46
/
Rare Processes ?
distributions. Why ?
ATLAS :
• Event rate ~ 1 GHz
n n
( S+ B) S B
P ({mi }i=1… n ) = e −( S+ B)
n!
∏ S+ B
G( mi ; m H , σ ) +
S+ B
−α m
α e 48 i
i=1 /
Poisson Example
n
Assume Poisson distribution with B = 0 : −S S
P ( n ; S) = e
Say we observe n=5, want to infer information on the parameter S n!
→ Try different values of S for a fixed data value n=5
→ Varying parameter, fixed data: likelihood
5
−S S
L( S ; n=5)=e
5!
Observed
Value n=5
49
n /
Poisson Example
n
Assume Poisson distribution with B = 0 : −S S
P ( n ; S) = e
Say we observe n=5, want to infer information on the parameter S n!
→ Try different values of S for a fixed data value n=5
→ Varying parameter, fixed data: likelihood
5
−S S
L( S ; n=5)=e
5!
Read L(S; n=5) here
Observed
P(S = 0.5)
Value n=5
Low
likelihood
49
n /
Poisson Example
n
Assume Poisson distribution with B = 0 : −S S
P ( n ; S) = e
Say we observe n=5, want to infer information on the parameter S n!
→ Try different values of S for a fixed data value n=5
→ Varying parameter, fixed data: likelihood
5
−S S
L( S ; n=5)=e
5!
Read L(S; n=5) here
Observed
P(S = 0.5)
Value n=5
Low
likelihood P(S = 5)
High
likelihood
49
n /
Poisson Example
n
Assume Poisson distribution with B = 0 : −S S
P ( n ; S) = e
Say we observe n=5, want to infer information on the parameter S n!
→ Try different values of S for a fixed data value n=5
→ Varying parameter, fixed data: likelihood
5
−S S
L( S ; n=5)=e
5!
Read L(S; n=5) here
Observed
P(S = 0.5)
Value n=5
Low
likelihood P(S = 5)
P(S = 20)
High
Low
likelihood
likelihood
49
n /
Poisson Example
n
Assume Poisson distribution with B = 0 : −S S
P ( n ; S) = e
Say we observe n=5, want to infer information on the parameter S n!
→ Try different values of S for a fixed data value n=5
→ Varying parameter, fixed data: likelihood
5
−S S
L( S ; n=5)=e
5!
Read L(S; n=5) here
49
n /
MLEs in Shape Analyses
Binned shape analysis:
N
L(S ; ni ) = P(n i ; S) = ∏ Pois (ni ; S f i + Bi )
i=1
( )
N N 2
ni −(S f i + B i )
λ Gaus (S) = ∑ −2 log G (ni ; S f i + Bi , σ i ) = ∑ σi χ2 formula!
i=1 i=1
→ Gaussian MLE (min χ2 or min λGaus) : Best fit value in a χ2 (Least-squares) fit
→ Poisson MLE (min λPois) : Best fit value in a likelihood fit (in ROOT, fit option “L”)
In RooFit, λPois ⇒ RooAbsPdf::fitTo(), λGaus ⇒ RooAbsPdf::chi2FitTo().
ATLAS-CONF-2017-045 51
/
MLE Properties
( )
* 2
^
( μ−μ )
• Asymptotically Gaussian P ( μ^ ) ∝ exp − 2
for n → ∞
* 2 σ μ^
and unbiased ⟨ μ^ ⟩ = μ for n → ∞
• Asymptotically Efficient : σμ̂ is the lowest possible value (in the limit n®¥)
among consistent estimators.
→ MLE captures all the available information in the data
n→∞
• Also consistent: μ̂ converges to the true value for large n, μ^ → μ *
• Log-likelihood : Can also minimize l = -2 log L
Fisher Information:
⟨( ⟩ ⟨ ⟩
2 2
I (μ ) = ∂ log L(μ )
∂μ ) = − ∂ 2 log L(μ )
∂μ
Measures the amount of information available in the measurement of μ.
Gaussian case:
1 ● For a Gaussian estimator μ̃
Gaussian likelihood: I (μ ) = 2
σ Gauss
( (~
)
* 2
μ −μ )
P (~
μ ) ∝ exp −
→ smaller σGauss ⇒ more information. 2
2 σ ~μ
● MLE: Var(μ̂) = σμ̂2
1
Cramer-Rao bound: Var( ~
μ)≥ Cramer-Rao: Var(μ̃) ≥ σGauss2 = σμ̃2
I (μ )
For any estimator μ̃ .
Efficient estimators reach the bound : e.g. MLE in the large dataset limit.
53
/
High-mass X→γγ Search: JHEP 09 (2016) 1
Some Examples
Higgs Discovery: Phys. Lett. B 716 (2012) 1-29
3.9σ