Week 9 Estimation
Week 9 Estimation
for
STATISTICS FOR DATA SCIENCE - 2
Contents
1 Statistical Problems in real life 3
1.1 Example I: Who is the best captain in the IPL? . . . . . . . . . . . . . . . . 4
1.2 Example 2: How many tigers are there in India? . . . . . . . . . . . . . . . . 5
1.3 Example 3: Was a remote-proctored exam successful? . . . . . . . . . . . . . 5
4 Estimation error 11
4.1 Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2 Risk: Squared error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.4 Bias-variance tradeoff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2
Chapter 9
• Problem and planning: To understand the problem, we need to ask questions. This
phase of the analysis is not so mathematically, analytically and probability oriented.
• Data availability: This phase of the analysis is very important, this is where the
statistical nature of the analysis starts entering the picture. So, we start looking at
what data is available.
• Analysis:
Generally, the analysis part is usually very well-defined in practice. You have a model,
you fit a model and then you ask some specific questions of estimating parameters or
testing hypothesis within that.
– Develop visualization for communicating results. The most important part of the
statistical procedure is the conclusion and the way you are going to communicate it
to others. The communication should be clear and backed up with some reasoning.
We will see in later section how communication is important with the help of an
example.
So, this is how a typical data problem or statistical problem and analysis works. There are
so many different phases and in this course, we are going to focus on the analysis phase.
We will now see a couple of examples of this flavour, so that we get an idea of the big picture
on how statistical analysis works.
3
1.1 Example I: Who is the best captain in the IPL?
• Problem and planning
– Here, we can ask questions like what are the qualities of a good captain. Some
people might say we can look at win or loss records. Whoever wins the most
number of matches will be considered as the best captain. Doing this is not really
difficult.
– But we want the “best” captain, not the most winning captain. Some people may
argue that under the best captain, players play to their best. In that case, we
will have to look at how other players were performing during the captaincy of a
particular captain.
Domain knowledge is very important during the problem understanding and planning
phase. So, we would want some experts who can give their inputs as part of the
statistical decision process. People will start giving their opinions when such kind of
questions are asked and those opinions will be backed by data. One obvious answer
to this question can be Mahendra Singh Dhoni. But we do not want our data to to
driven much by opinion, we want it to be based on data. So, that takes us to the next
part which is data.
• Data
The best possible data here is score sheets from matches. How to collect and consol-
idate the data, we have already seen for the IPL before. We can also check for other
tournaments. We can also do some sampling of experts and fans which will help in
designing a clever sampling.
• Analysis:
– We will check what happened in the matches when a particular person was the
captain. We want to see if the captaincy played a role as opposed to just individual
talent of a player or just natural conditions.
We can have the following hypotheses:
Once we formulate the question, we will have to think of a model where cap-
taincy would have affected the flow of the data in some sense. This is where the
knowledge of dealing with statistical procedures within a model will help you.
– Derive some metrics that measure the captaincy in the IPL. Some metrics like,
measure the influence of captaincy in a match. We need to know some domain
knowledge as well as the statistics and put it together to come up with some clear
analysis.
4
• Conclusion and communication:
Develop visualizations for communicating results. The communication should be clean.
Think of a typical fan and then give some reasons for why your analysis is good and
what led you to make this kind of a statement about a particular captain in IPL.
• Statistical methods:
5
A picture from the remote-proctored exam
– How to access success of exam? With the success of exam, we mean the success
of exam administration process.
– Honor code, possible collaboration. To answer this question, we will use the exam
honour code as a metric. If we know that there were not significant honor code
violations, at least the sanctity of the exam was not violated.
• Data
• Analysis: Check the histogram of the scores if it is similar to the previous in person
exams. If it is very different, we can say the honor code was violated.
– Test the hypothesis that “honor code” was violated. Look at all the data associ-
ated with the student and have a way of making a call whether or not honor code
was violated.
– Estimate number of violations.
– Detect violators or groups of collaborators
6
• Conclusion and communication:
• One honor code is violated, where someone is caught doing something which is not
acceptable.
Can we say the remote-proctored exam was successful at least from the honor code point of
view? We will look at this scenario in two different ways:
• Communication II: Honor code violations within 0.15% under remote proctoring. This
seems more truthful of what actually happened.
You will find such communications in the press, in the social media, so be doubly cautious.
Do not let your thoughts be controlled by news and things that you see there. Truthful
representation of what the data conveyed is very difficult to find. Quite often the person
who sends it has their own ideology and they are communicating to you in a factually correct,
but truthfully wrong way. Today is a social media driven generation, so, this is a skill you
have to pick up.
7
3 Introduction to parameter estimation
In this section, we will look closely at the analysis part within the probabilistic statistical
setting. One such procedure or statistical analysis method is called parameter estimation.
This will show up quite often within the realm of a bigger statistical problem. Given any
statistical problem, at the end we will have to find one parameter which is missing in the
model using data. We have seen before how iid samples convey a lot about the underlying
distribution, so, if a parameter is missing from the distribution, how do you find it. This
is a called a parameter estimation problem. We will see few examples to understand the
problem.
1. Setting
4. We observe a certain number of iid samples from the distribution and using the observed
samples, we are required to estimate the parameter.
8
n Observed Poisson fit
0-2 18 12.2
3 28 27.0
4 56 56.5
5 105 94.9
6 126 132.7
7 146 132.7
8 164 166.9
9 161 155.6
10 123 130.6
11 101 99.7
12 74 69.7
13 53 45.0
14 23 27.0
15 15 15.1
16 9 7.9
17+ 5 7.1
Table 1
9
– σ 2 : variance in the voltage or current.
• Now, we will take iid samples of the voltage or current. Suppose we have 10 measure-
ments: 1.07, 0.91, 0.88, 1.07, 1.15, 1.02, 0.99, 0.99, 1.08, 1.08
From here we can try to find µ and σ 2
At the end of this section, if somebody gives you iid samples coming from a distribution
with unknown parameters, you should be able to find the parameter. This is a parameter
estimation problem.
X1 , . . . , Xn ∼ iidX
• The distribution of X may not be completely known. Suppose X has some distribution
described with parameters θ1 , θ2 , . . ., where θi ∈ R.
θ̂ : (X1 , . . . , Xn ) → R
10
X1 + X 2
2. Estimator 2: p̂2 =
2
X1 + . . . + X n
3. Estimator 3: p̂3 =
n
All the three estimators are valid. Estimator is just some function from the samples to
the real line. But the question to ask is which among them is a good estimator.
1
• p̂1 =
2
1
Here, the estimator will always give us the value regardless of the samples. This
2
does not use the samples at all, so it does not seem to be a good estimator, but still it
is a valid estimator.
X1 + X2
• p̂2 =
2
This seems like a reasonably better estimator than the first one because it uses the first
two samples, but still it is just the first two samples. We would want our estimator to
use all the n samples given to you.
X1 + . . . + Xn
• p̂3 =
n
This estimator gives more meaningful information, this is a very smart estimator in a
sense which we will see later. Again, it is a valid estimator.
Estimator is a function from samples to the real line, so we can have many functions, it
means infinite number of estimators are possible. For example: 2X1 − X2 , X1 + 2X2 + 3X3
are also valid estimators. But validity does not imply good. So, how do we characterise a
good estimator? We know that any estimator we come up with should have a distribution
around the real data. In the next section, we will see how to design those estimators and
what metric should we use for that.
4 Estimation error
In the previous section, we saw the parameter estimation problem. What we are trying to
estimate is a constant value, so it will not have a distribution, the estimator will have a
distribution and we are hoping a good estimator will have very little error, it will always
predict values close to the actual value θ. So, it is important to study the error in the
estimation problem process.
In this section, we will see what is error, how do we quantify it, how to control the error,
can we make the probability for high values of error very low, and so on.
11
, where error itself is a random variable. We expect that a good estimator will take values
close to the actual θ, i.e., error should be very very close to 0.
Mathematically, P (| Error |> δ) should be small. So, we look at the distribution of the
error and we want it in such a way such that the probability with which it takes very large
values is very small. But how do we pick such a δ, how big should δ be? We can understand
it using few examples.
• Firstly, we will have to understand that the parameter θ will have a certain range. For
example, let X ∼ Bernoulli(p), where 0 ≤ p ≤ 1.
– Suppose we take δ to be 0.01 for Bernoulli(p). We may think that an error of 0.01
is very small, but what if p = 10−5 ? In that case, 0.01 is huge compared to what
p takes.
So, it is good to have the magnitude of error characterised in terms of the parameter
we are estimating. So, if we are estimating p, then the error should not be more than
10% of p. Mathematically, we can write
p
Error ≤
10
• Now, suppose X ∼ Normal(µ, σ 2 ) and we want to estimate µ. µ can take any value
in this case, there is no restriction. In such cases, how do we think of the error? We
can think of it as a fraction of what we are estimating. So, if as a fraction of what you
are estimating is only 10% error or something, then you can be sure that the error is
small.
– Let µ = 1000, and if we estimate it as 1001, then we are good. But if µ = 0.1 and
we are estimating it as 1.1, then it is a huge error.
• Sampling 1: 1, 0, 0, 1, 0, 1, 1, 1, 0, 0
1
– p̂1 = = 0.5
2
12
1+0
– p̂2 = = 0.5
2
5
– p̂3 = = 0.5
10
• Sampling 2: 1, 0, 0, 1, 0, 1, 0, 1, 0, 0
1
– p̂1 = = 0.5
2
1+0
– p̂2 = = 0.5
2
4
– p̂3 = = 0.4
10
• Sampling 3: 1, 1, 0, 0, 0, 1, 0, 1, 0, 1
1
– p̂1 = = 0.5
2
1+1
– p̂2 = =1
2
5
– p̂3 = = 0.5
10
Observations:
• p̂1 will not work for all values of p. Suppose if p were to be 1, we will just get 1, 1, . . . , 1
even when we see different samples. So saying p̂1 = 0.5 will give an error. Only if p is
very close to 0.5, it will work.
• p̂2 tends to vary a lot with the samples. Sometimes, it takes the value 1, if we take
another sampling where the first two samples are 0, p̂2 will take the value 0. So, the
variation in the estimator is very large, it is going from 0 to 1 to 0.5, jumping all over
the place, it is not staying steady.
• p̂3 seems to be very promising, it stays steady with the different samples, the variation
is less.
When we compare p̂2 with p̂1 , p̂1 seems like a good estimator to have. p̂1 is not able to
adapt to different values of p easily, but at least it does not keep jumping around. On the
other hand, p̂2 may adapt to different values of p, but across different samplings, it seems to
give a wider variation.
So, given any estimation problem and given a bunch of estimators, we should first try to
stimulate the samples. If it is varying too much or if it is holding steady, or if it is working
out for different values of the unknown parameter. The variation in the estimator value
should not be too high. At the same time, it should not be just stuck at one value.
For the above example, we want to find the errors for estimators, their distributions and
the probability that absolute value of error is greater than p/10.
13
1
• p̂1 =
2
1
Error = − p
2
– In the above figure, you can see the plot of | 1/2 − p | and p/10. | 1/2 − p | is
above p/10 most of the time, only between 5/11 and 5/9, the absolute value of
error is less than p/10. When the value of p is very close to 0.5, the absolute value
of error is falling below p/10, for any other values it is always above.
– Error is constant, there is no randomness there. So, probability that error is
5 5
greater than p/10 is equal to 1 if p is either less than or p is greater than .
11 9
We can see that the absolute value of error being greater than p/10 is true for a
large range of p except for p around 0.5. So, estimator 1 is not so good.
X1 + X2
• p̂2 =
2
X1 + X2
Error = −p
2
P (| Error |> p/10) = 1 if p < 5/11 or 5/9 < p < 10/11.
14
x1 x2 e P (Error = e)
0 0 −p (1 − p)2
0 1 1/2 − p p(1 − p)
1 0 1/2 − p p(1 − p)
1 1 1−p p2
Table 2
– In the above figure, we can see the plot of e and p/10. Error e is above p/10 most
of the time, only between 5/9 and 10/11, and for p < 5/11, the absolute value
of error is less than p/10. So, the probability of error being greater than 10% is
equal to 1 for a large range of p even in Estimator 2.
X1 + . . . + Xn
• p̂3 =
n
X1 + . . . + Xn
Error = −p
n
– We are interested in finding the probability that the absolute value of error is
greater than p/10. Can we control this in the estimator? We will see that p̂3 gives
a very surprising result.
15
– From Chebyshev’s inequality we know that for any random variable X,
Var(X)
P (| X − E[X] |> δ) ≤
δ2
Since error is a random variable, we can write
Var(Error)
P (| Error − E[Error] |> δ) ≤
δ2
p
– Put δ = in the above bound, we will get
10
Var(Error)
P (| Error − E[Error] |> p/10) ≤
(p/10)2
X1 + . . . + Xn X1 + . . . + Xn
E[Error] = E −p =E − E[p] = 0
n n
This turns out to be a desirable property in the estimator, because on an average
the estimator should give you the error 0.
X1 + . . . + Xn
Var(Error) =Var −p
n
X1 + . . . + Xn
=Var
n
1
= 2 Var(X1 + . . . + Xn )
n
np(1 − p)
=
n2
Therefore,
100(1 − p)
P (| Error] |> p/10) ≤
np
100(1 − p)
• For any fixed p, as n becomes larger and larger, goes to zero. So, Cheby-
np
1
shev’s bound results in the fall of .
n
Why is p̂3 a good estimator?
1. It is using all the samples and with the increase in the number of samples, the perfor-
mance of the estimator increases.
16
2. The accuracy of the estimator keeps improving with increase in the value of n. In case
of estimators p̂1 and p̂2 , there is no n. Probability that error becomes greater than
p/10 was just 1 for some cases. But in case of p̂3 , we are able to make n appear in these
probabilities in a certain way such that for any value of the parameter, probability goes
to 0 with increase in n.
Observations:
• Various estimators are usually possible. Every estimator will have an error and the
error will have a distribution. We expect its distribution to be around 0.
• Bounds on P (| Error |> δ) are interesting and capture useful properties of the estima-
tor.
Good design: Probability that absolute value of error is greater than δ is a very useful
way of characterizing the estimator. P (| Error |> δ) will fall with n.
Var(Error)
P (| Error − E[Error] |> δ) ≤
δ2
We saw in this section that how estimators will give us the errors. The errors will have
a distribution and with larger and larger number of samples, we can control the magnitude
of the error in the distribution. The probability with which the error becomes very large
can be controlled through some tools like Chebyshev inequality. This gives us a good design
principle. In the next section we will talk about good ways of designing the estimators.
4.1 Bias
X1 , X2 , . . . , Xn ∼ i.i.d X, with parameter θ. Let θ̂ be an estimator of θ.
Definition: The bias of the estimator θ̂ for a parameter θ, denoted Bias(θ, θ̂) is defined as
For different samplings, we get different values of θ̂, we would expect θ̂ to be closed to
θ, so E[θ̂ − θ] gives difference on average. We want to control the distribution of the error.
We say the estimator is unbiased if E[θ̂ − θ] = 0. We want our estimator to be unbiased or
at least the bias should be very low. If on average, the value of the estimate is not small
enough, the estimator is not good.
17
4.2 Risk: Squared error
Definition: The (squared-error) risk of the estimator θ̂ for a parameter θ, denoted Risk(θ, θ̂)
is defined as
Risk(θ, θ̂) = E[θ̂ − θ]2
The bias can be either positive or negative, that can make it a bit misleading or it may
not capture the entire picture. We may be interested more in the amount by which E[θ̂]
differs from θ, this is where risk plays an important role, whether you deviate on the left
or the right, the penalty will be same, it cannot be 0. For example, if we look at E[θ̂ − θ],
θ̂ can be way off to the right of θ for a long time or it can be way off to the left of θ for
a long time, and on an average, we will get 0. So, sometimes even if we have an unbiased
estimator, estimator can be bad and risk is a better way to measure your estimator, which
puts a penalty of (θ̂ − θ)2 . If the risk is small, θ̂ will definitely take values close to θ.
Squared error risk is the second moment of error. Another terminology used for this is
mean squared error (MSE).
4.3 Variance
We want our estimator to have less variance. We want an estimator such that given any
set of samples, it does not deviate too much. In short, we don’t want θ̂ to vary too much.
Variance of an estimator θ̂ for a parameter θ is defined as
Note: Variance of error, θ̂ − θ is same as the variance of estimator θ̂, i.e., Var(θ̂). Error is
just the translated version of θ̂, so variance of error turns out to be same as the variance of
estimator.
Theorem: Bias-variance tradeoff: The risk of the estimator satisfies the following relation-
ship:
Risk(θ̂, θ) = Bias(θ̂, θ)2 + Var(θ̂)
Proof: Risk(θ̂, θ) = E[(θ̂ − θ)2 ] = E[θ̂ − θ]2 + Var(θ̂)
Using this relationship, we can see that if we want to keep the mean squared error small,
we will have to keep both the bias and variance small. Also, we can do some tradeoff between
these two, we can decrease the bias, and increase my variance to balance it out.
While designing an estimator, we should try to reduce the bias. If we reduce the bias,
risk will go down, but sometimes reducing bias can lead to increase in the variance. If we just
18
keep decreasing the bias, variance may blow up, and if we just keep decreasing the variance,
bias may blow up. So, we have to balance between these two, so this tradeoff is very useful.
Solution:
• Bias = E[p̂1 ] − p
1
E[p̂1 ] =E[1/2] =
2
1
Bias = −p
2
1
• Variance(p̂1 ) = Var =0
2
• Risk(p̂) = Bias2 + Var(p̂)
Therefore,
2
1
Risk = −p +0
2
2
1
= −p
2
X1 + X2
2. p̂2 =
2
Solution:
• Bias = E[p̂2 ] − p
X1 + X 2
E[p̂2 ] =E
2
1
= E[X1 + X2 ]
2
E[X1 ] + E[X2 ] 2p
= = =p
2 2
Bias = p − p = 0
X 1 + X2 1 1 p(1 − p)
• Variance(p̂2 ) = Var = Var(X1 + X2 ) = (2p(1 − p)) =
2 4 4 2
19
• Risk(p̂2 ) = Bias2 + Var(p̂)
Therefore,
p(1 − p)
Risk =0 +
2
p(1 − p)
=
2
X1 + . . . + Xn
3. p̂3 =
n
Solution:
• Bias = E[p̂3 ] − p
X1 + X 2 +
E[p̂2 ] =E
2
1
= E[X1 + X2 ]
2
E[X1 ] + E[X2 ] 2p
= = =p
2 2
Bias = p − p = 0
X 1 + X2 1 1 p(1 − p)
• Variance(p̂3 ) = Var = Var(X1 + X2 ) = (2p(1 − p)) =
2 4 4 2
• Risk(p̂3 ) = Bias2 + Var(p̂)
Therefore,
p(1 − p)
Risk =0 +
2
p(1 − p)
=
2
Solution:
20
• Bias = E[p̂] − p
√
X1 + . . . + Xn + n/2
E[p̂] =E √
n+ n
1 √
= √ E[X1 + . . . + Xn + n/2]
n+ n
1 √
= √ E[X1 ] + . . . + E[Xn ] + n/2
n+ n
√
np + n/2
= √
n+ n
√ √ √ √ √
np + n/2 np + n/2 − np − np n/2 − np
Bias = √ −p= √ = √
n+ n n+ n n+ n
√
X1 + . . . + Xn + n/2 X1 + . . . + X n
• Variance(p̂) = Var √ = Var √
n+ n n+ n
Therefore,
1
Var(p̂) = √ Var(X1 + . . . + Xn )
(n + n)2
np(1 − p)
= √
(n + n)2
21
5.1 Moments and parameters
Suppose we have a random variable X with some distribution fX (x), either PDF or PMF
with some unknowns which we call as parameters θ1 , θ2 , . . .. Now, given the PMF or a PDF,
we can always compute the moments or the expected value, which can be expressed as a
function of parameters.
For example,
1. X ∼ Bernoulli(p)
E[X] = p
2. X ∼ Poisson(λ)
E[X] = λ
3. X ∼ Exponential(λ)
E[X] = 1/λ
4. X ∼ Normal(µ, σ 2 )
E[X] = µ and E[X 2 ] = σ 2 + µ2
5. X ∼ Gamma(α, β)
α α2 α
E[X] = and E[X 2 ] = 2 + 2
β β β
6. X ∼ Binomial(n, p)
E[X] = np and E[X 2 ] = np(1 − p) + n2 p2
You can see in all the above examples that we can write the moments as a function of
parameters.
22
Therefore, sample moment is the average of the k-th power of the samples that we observe.
Note that sample moment is a random variable and will have a distribution. It is not the
same as the distribution moment. Distribution moment is a fixed constant function of some
parameters. We expect Mk to take values around the expected value of E[X k ]. Using WLLN,
CLT, we can argue that the sample moments for larger and larger samples take values close
to their distribution.
• Procedure
– Equate sample moments to expression for moments in terms of unknown param-
eters.
– Solve for the unknown parameters.
• One parameter θ usually needs one moment. This is because we expect the first moment
to be a function of θ. In case, it is a constant, for example, in case of a Normal(0, σ 2 ).
The first moment µ is a constant, which is not a function of θ. We keep looking till we
get the first moment. In this case, that would be the second moment.
– Sample moment: m1
– Distribution moment: E[X] = f (θ)
– Solve for θ from f (θ) = m1 in terms of m1
– θ: replace m1 by M1 in above solution
• Two parameters θ1 , θ2 usually needs two moments
– Sample moments: m1 , m2
– Distribution moments: E[X] = f (θ1 , θ2 ), E[X 2 ] = g(θ1 , θ2 )
– Solve for θ1 , θ2 from f (θ1 , θ2 ) = m1 , g(θ1 , θ2 ) = m2 in terms of m1 , m2 .
– θ̂1 , θ̂2 : replace m1 by M1 and m2 by M2 in above solution.
Solution:
23
X 1 + . . . + Xn
• Estimator, p̂ = M1 =
n
Example 6: Let X1 , . . . , Xn ∼ iid Poisson(λ). Find the method of moments estimate of λ.
Solution:
Solution:
• E[X] = µ, E[X 2 ] = µ2 + σ 2
(1 + 0 + 0 + 1 + 0 + 1 + 1 + 1 + 0 + 0) 5
p̂ = = = 0.5
10 10
24
2. Alpha particles emission in 10 sec: Poisson()λ
Number of particles emitted per second = 0.8392
3. Normal(µ, σ 2 ) : 1.07, 0.91, 0.88, 1.07, 1.15, 1.02, 0.99, 0.99, 1.08, 1.08
It is not unreasonable for some method like method of moments to work. It is quite
useful, quite often we may not get a handle on the actual distribution itself, we may not
know the distribution, and we have to guess at it. So instead of guessing, we can just use the
moments, the moments can be quite reliably estimated from the samples using the sample
moments.
(x − µ)2
1 −
fX (x) = √ e 2σ 2
σ 2π
We can also write fX (x) as fX (x, µ, σ), to bring out the fact that µ and σ plays a role in
the PDF of a normal distribution.
Samples are usually random variables, so for a particular sampling instance, we know
that those random variables take some actual values. We can denote the actual values
as x1 , x2 , . . . , xn . We denote the likelihood of that actual sampling as L(x1 , x2 , . . . , xn ).
25
Sometimes, the arguments will be dropped, we will just say the likelihood function is L. It
is defined as n
Y
L(x1 , x2 , . . . , xn ) = fX (xi , θ1 , θ2 , . . .)
i=1
The likelihood represents the probability of occurrence of a particular sample. Since the
samples are independent, we can multiply their likelihoods. Likelihood is a function of the
unknown parameters of the distribution.
Solution:
Example 9: Consider a sampling instance from Normal(µ, σ 2 ): 1.07, 0.91, 0.88, 1.07,
1.15, 1.02, 0.99, 0.99, 1.08, 1.08. Find the likelihood of the samples.
Solution:
Likelihood,
(1.07 − µ)2 (1.08 − µ)2
1 − 1 −
L= √ e 2σ 2 ... √ e 2σ 2
σ 2π σ 2π
10 −1
1 (1.07−µ)2 +...+(1.08−µ)2
= √ e 2σ 2
2σ 2π
The question is Why maximize the likelihood? It is sort of intuitive because the samples
are coming from a distribution with unknown parameters, we are calculating their probabil-
ities. Now out of all the parameters, which parameter gave the maximum likelihood for that
particular sample. This seems like a very natural way to define an estimator.
Therefore, we find the likelihood of a sampling instance and then find the parameter that
maximizes the likelihood.
26
Example 10: X1 , . . . , Xn ∼ iid Bernoulli(p). Find the ML estimate of p.
Solution:
• ML estimation:
Log likelihood, log L = w log p + (n − w) log(1 − p)
Solution:
λe−λk
P (X = k) =
k!
n λe−λxi λn e−λ(x1 +...+xn )
• Likelihood function, L =
Q
=
i=1 xi ! x1 !x2 ! . . . xn !
• ML estimation:
Log likelihood,
27
Differentiate log L w.r.t. λ and equate it to 0.
n
− (x1 + . . . + xn ) = 0
λ
x1 + . . . + xn
=⇒ λ =
n
=⇒ λ = X
Therefore, λ̂M L = X
Solution:
• Samples: x1 , x2 , . . . , xn .
(x − µ)2
1 −
P (X = x) = √ e 2σ 2
σ 2π
(xi − µ)2 n 1
1 n
− 1 − [(x1 −µ)2 +...+(xn −µ)2 ]
• Likelihood function, L = 2σ 2
e 2σ 2
Q
√ e = √
i=1 σ 2π σ 2π
• ML estimation:
Log likelihood,
1 1
log L =n log √ − 2 [(x1 − µ)2 + . . . + (xn − µ)2 ]
σ 2π 2σ
√ 1
= − n log(σ 2π) − 2 [(x1 − µ)2 + . . . + (xn − µ)2 ]
2σ
1. ML estimate of µ
√
∗ 1
µ = arg max −n log(σ 2π) − 2 [(x1 − µ)2 + . . . + (xn − µ)2 ]
µ 2σ
28
Differentiate log L w.r.t. µ keeping σ constant and equate it to 0.
n
d √ 1 X
[−n log(σ 2π) − 2 (xi − µ)2 ] = 0
dµ 2σ i=1
n
1 X
=⇒ 0 + 2 2(xi − µ) = 0
2σ i=1
n
1 X
=⇒ 2 (xi − µ) = 0
σ i=1
n
X
=⇒ Xi − nµ = 0
i=1
x1 + . . . + xn
=⇒ µ = =X
n
Therefore, µ̂M L = X
2. ML estimate of σ 2
√
2∗ 1
σ = arg max −n log(σ 2π) − 2 [(x1 − µ)2 + . . . + (xn − µ)2 ]
σ 2σ
6.3 Observations
• Maximum likelihood is a very popular method for deriving estimators.
29
• Theoretically and intuitively appealing: maximize the probability or likelihood of the
observed samples.
• Numerous questions
Solution:
30
Example 14: Let X1 , X2 , . . . , Xn ∼ iid X, where the distribution of X is
x 1 2 3
, where p1 + p2 + p3 = 1, and 0 < pi < 1.
fX (x) p1 p2 p3
Find the method of moments estimate and maximum likelihood estimate of p1 , p2 and
p3 .
Solution:
31
– Likelihood function, L = pw1 w2
1 p2 (1 − p1 − p2 )
n−w1 −w2
Solution:
32
1
θ∗ = arg max
θ θn
In order to maximize L, we need to pick the least possible value of θ.
Since x1 , . . . , xn < θ, the least possible θ will be max(x1 , . . . , xn ).
Therefore, θ̂M L = max(X1 , . . . , Xn ).
Solution:
First sample
moment,
m1 = X
1 1 1 1 N +1
E[X] = 1 +2 + ... + N = (1 + 2 + . . . + N ) =
N N N N 2
N +1
Now, m1 = =⇒ N = 2m1 − 1
2
Therefore, MME estimator of N is
X1 + . . . + Xn
N̂M M E = 2 −1
n
β α α−1 −βx
fX (x) = x e
Γ(α)
Find the method of moments estimate and maximum likelihood estimate of α and β.
Solution:
33
• Using Method of moments
Now,
α
m1 = =⇒ α = βm1 (5)
β
α α2
m2 = +
β2 β2
βm1 (βm1 )2
= 2 + [Using (5)]
β β2
m1
= + m21
β
m1
β= (6)
m2 − m21
m21
m1
Now, α = βm1 = m1 =
m2 − m21 m2 − m21
– Likelihood function,
n α n
Y β α α−1 −βxi β
L= xi e = (x1 x2 . . . xn )α−1 e−β(x1 +...+xn )
i=1
Γ(α) Γ(α)
34
– Log likelihood,
Solution:
Now,
m1 = N p (7)
m2 =N p(1 − p) + N 2 p2
=m1 (1 − p) + m21 [Using (7)]
35
m21 + (m1 − m2 )
p= (8)
m1
m21
Now, N =
m21 + (m1 − m2 )
– Likelihood function,
n
Y N
L= pxi (1 − p)N −xi
i=1
x i
N N N x1 +...+xn
= ... p (1 − p)nN −(x1 +...+xn )
x1 x2 xn
– Log likelihood,
N N N
log L = log +log +. . .+log +(x1 +. . .+xn ) log p+(nN −(x1 +. . .+xn )) log(1−p)
x1 x2 xn
36