0% found this document useful (0 votes)
5 views

Week 9 Estimation

Uploaded by

tanu1shukla
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Week 9 Estimation

Uploaded by

tanu1shukla
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Practice Book

for
STATISTICS FOR DATA SCIENCE - 2
Contents
1 Statistical Problems in real life 3
1.1 Example I: Who is the best captain in the IPL? . . . . . . . . . . . . . . . . 4
1.2 Example 2: How many tigers are there in India? . . . . . . . . . . . . . . . . 5
1.3 Example 3: Was a remote-proctored exam successful? . . . . . . . . . . . . . 5

2 The importance of communication 7

3 Introduction to parameter estimation 8


3.1 Illustrative example 1: Bernoulli(p) trails . . . . . . . . . . . . . . . . . . . . 8
3.2 Illustrative example 2: Emission of alpha particles . . . . . . . . . . . . . . . 8
3.3 Illustrative example 3: Noise in electronic circuits . . . . . . . . . . . . . . . 9
3.4 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4 Estimation error 11
4.1 Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2 Risk: Squared error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.4 Bias-variance tradeoff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

5 Estimator design approach: Method of moments 21


5.1 Moments and parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.2 Moments of samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.3 Method of moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.4 Method of moments estimation . . . . . . . . . . . . . . . . . . . . . . . . . 24

6 Estimator design approach: Maximum Likelihood 25


6.1 Likelihood of iid samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6.2 Maximum likelihood (ML) estimator . . . . . . . . . . . . . . . . . . . . . . 26
6.3 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

7 Finding MME and ML estimators 30

2
Chapter 9

1 Statistical Problems in real life


Statistical problems in real life have so many different dimensions. We will start the unit
from how statistical problems arise in real life and just a big picture view on how statistical
analysis works. We are going to ask some interesting questions for which their might be
some statistical analysis and methods possible and then talk about various things that are
involved in the process. To answer such problems in a data driven way, these are the following
stages/phases:

• Problem and planning: To understand the problem, we need to ask questions. This
phase of the analysis is not so mathematically, analytically and probability oriented.

• Data availability: This phase of the analysis is very important, this is where the
statistical nature of the analysis starts entering the picture. So, we start looking at
what data is available.

• Analysis:

– Study data: We start looking at descriptive statistics, understanding the data,


histograms, scatter plots, dependencies.
– Find patterns and fit models or form hypotheses: Once we have a model, there are
fixed statistical procedures for finding unknown parameters, testing hypothesis,
and so on.

Generally, the analysis part is usually very well-defined in practice. You have a model,
you fit a model and then you ask some specific questions of estimating parameters or
testing hypothesis within that.

• Conclusion and communication:

– Develop visualization for communicating results. The most important part of the
statistical procedure is the conclusion and the way you are going to communicate it
to others. The communication should be clear and backed up with some reasoning.
We will see in later section how communication is important with the help of an
example.

So, this is how a typical data problem or statistical problem and analysis works. There are
so many different phases and in this course, we are going to focus on the analysis phase.
We will now see a couple of examples of this flavour, so that we get an idea of the big picture
on how statistical analysis works.

3
1.1 Example I: Who is the best captain in the IPL?
• Problem and planning
– Here, we can ask questions like what are the qualities of a good captain. Some
people might say we can look at win or loss records. Whoever wins the most
number of matches will be considered as the best captain. Doing this is not really
difficult.
– But we want the “best” captain, not the most winning captain. Some people may
argue that under the best captain, players play to their best. In that case, we
will have to look at how other players were performing during the captaincy of a
particular captain.
Domain knowledge is very important during the problem understanding and planning
phase. So, we would want some experts who can give their inputs as part of the
statistical decision process. People will start giving their opinions when such kind of
questions are asked and those opinions will be backed by data. One obvious answer
to this question can be Mahendra Singh Dhoni. But we do not want our data to to
driven much by opinion, we want it to be based on data. So, that takes us to the next
part which is data.
• Data
The best possible data here is score sheets from matches. How to collect and consol-
idate the data, we have already seen for the IPL before. We can also check for other
tournaments. We can also do some sampling of experts and fans which will help in
designing a clever sampling.
• Analysis:
– We will check what happened in the matches when a particular person was the
captain. We want to see if the captaincy played a role as opposed to just individual
talent of a player or just natural conditions.
We can have the following hypotheses:

Null Hypothesis: The captaincy played a role in the match.


Alt. hypothesis: The captaincy did not play a role in the match.

Once we formulate the question, we will have to think of a model where cap-
taincy would have affected the flow of the data in some sense. This is where the
knowledge of dealing with statistical procedures within a model will help you.
– Derive some metrics that measure the captaincy in the IPL. Some metrics like,
measure the influence of captaincy in a match. We need to know some domain
knowledge as well as the statistics and put it together to come up with some clear
analysis.

4
• Conclusion and communication:
Develop visualizations for communicating results. The communication should be clean.
Think of a typical fan and then give some reasons for why your analysis is good and
what led you to make this kind of a statement about a particular captain in IPL.

1.2 Example 2: How many tigers are there in India?


All of us know how the number of tigers represents the health of the environment in one single
number. There is a statuary body for strengthening the tiger population called National
Tiger Conservation Authority (NTCA). They do a lot of activities, where one of the big
activity is Tiger census (You can refer to ntca.gov.in to read about how they do the tiger
census, their methodologies, and so on).

• Tiger census: Sampling over multiple phases/methods


Different ways of sampling:

– Survey by field forest staff


– Landscape characterisation using satellite and other data
– Intensive camera traps: This is a very recent and popular innovation in this field.
They install cameras in different places which sense movements and take pictures.

• Statistical methods:

– Find relationships between tiger population and various factors.


– Find a joint distribution likelihood model.
– Estimate number of tigers not camera-trapped.

1.3 Example 3: Was a remote-proctored exam successful?


Thousands of students gave the remote proctored exam. Was it successful? It is an important
problem. You have to convince authorities in the universities and also to students that this
exam was indeed successful. How do you statistically approach this problem?

5
A picture from the remote-proctored exam

• Problem and planning:

– How to access success of exam? With the success of exam, we mean the success
of exam administration process.
– Honor code, possible collaboration. To answer this question, we will use the exam
honour code as a metric. If we know that there were not significant honor code
violations, at least the sanctity of the exam was not violated.

• Data

– Scores in online exam


– Data from G-Meet, from proctors on how the invigilation went, how was the
camera angle of students, how was their connections.
– Scores in previous in-person exams either in the same subject or related subjects
written by the student. This is important because we believe violating honor code
in person exams are difficult. Here the sanctity is more clear.

• Analysis: Check the histogram of the scores if it is similar to the previous in person
exams. If it is very different, we can say the honor code was violated.

– Test the hypothesis that “honor code” was violated. Look at all the data associ-
ated with the student and have a way of making a call whether or not honor code
was violated.
– Estimate number of violations.
– Detect violators or groups of collaborators

6
• Conclusion and communication:

– We will have to communicate to university authorities that we can rely on the


marks and methods.
– We will also have to communicate to students so that they have confidence in the
fairness of the exam.

2 The importance of communication


Here is a simple example to illustrate how communication can completely change the per-
spective of what is being told. Suppose 1500 students wrote in person exam in a course.

• One honor code is violated, where someone is caught doing something which is not
acceptable.

• Two honor codes are violated.

Can we say the remote-proctored exam was successful at least from the honor code point of
view? We will look at this scenario in two different ways:

• Communication I: 100% increase in honor code violations in remote-proctored exams.


Honor code violations went from one to two. That is a 100 percent increase in honor
code violations in remote-proctored exams. A very factually correct statement, but
conveys a completely wrong idea about what happened in the exam.

• Communication II: Honor code violations within 0.15% under remote proctoring. This
seems more truthful of what actually happened.

You will find such communications in the press, in the social media, so be doubly cautious.
Do not let your thoughts be controlled by news and things that you see there. Truthful
representation of what the data conveyed is very difficult to find. Quite often the person
who sends it has their own ideology and they are communicating to you in a factually correct,
but truthfully wrong way. Today is a social media driven generation, so, this is a skill you
have to pick up.

7
3 Introduction to parameter estimation
In this section, we will look closely at the analysis part within the probabilistic statistical
setting. One such procedure or statistical analysis method is called parameter estimation.
This will show up quite often within the realm of a bigger statistical problem. Given any
statistical problem, at the end we will have to find one parameter which is missing in the
model using data. We have seen before how iid samples convey a lot about the underlying
distribution, so, if a parameter is missing from the distribution, how do you find it. This
is a called a parameter estimation problem. We will see few examples to understand the
problem.

3.1 Illustrative example 1: Bernoulli(p) trails


First we will start with a very basic example of Bernoulli trials, which we will keep using to
illustrate new ideas.

1. Setting

• n Bernoulli(p) trials, where the parameter p is unknown and 0 ≤ p ≤ 1.


• One set of sample (n = 10) from Bernoulli(p) trial: 1, 0, 0, 1, 0, 1, 1, 1, 0, 0.
Note that this sample will vary if we repeat the same experiment.
• Now the question here is can we estimate the value of the parameter p given the
samples? If we increase the sample size to say 100 or 500, we will be able to say
with more confidence what p will be.

The above is a simple parameter estimation problem.

2. Result of Bernoulli trial is a random variable: X ∼ {0, 1}.

3. Distribution of X : P (X = 0) = 1 − p, P (X = 1) = p, where p is the parameter of the


distribution.

4. We observe a certain number of iid samples from the distribution and using the observed
samples, we are required to estimate the parameter.

3.2 Illustrative example 2: Emission of alpha particles


Suppose we have a radioactive substance, and some particles come out of it which we call
it as an alpha particle. We can model the number of particles emitted within a fixed time
period.

• Number of particles N emitted in a 10 sec interval.


λe−λn
• We can model this as a Poisson distribution. So, P (X = n) = .
n!

8
n Observed Poisson fit
0-2 18 12.2
3 28 27.0
4 56 56.5
5 105 94.9
6 126 132.7
7 146 132.7
8 164 166.9
9 161 155.6
10 123 130.6
11 101 99.7
12 74 69.7
13 53 45.0
14 23 27.0
15 15 15.1
16 9 7.9
17+ 5 7.1

Table 1

• λ is a parameter and it represents the average number of particles emitted.


λ here will vary depending on the radioactive substance but the model will remain the
same. So, given any radioactive substance, we count the alpha particles using some counter
and we get the data and find our parameter λ.
We have taken a sample from one observation which you can find in Table 1.
From the above table, we can observe the number of alpha particles over several intervals.
Using this data, now we have to find λ. Here we have a Poisson distribution with an unknown
parameter. For different samplings, we will get different data, this is how iid samples are
coming to the picture.

3.3 Illustrative example 3: Noise in electronic circuits


Suppose we have a circuit and we want to measure the voltage or current. Over the period
of time, we would expect the voltage or current to be same. But if we keep measuring it
with very sensitive measurement instrument, we will see some random fluctuations over time.
This happens because of some various noise processes that happen with electrons and other
things that are there in the circuit.
• A popular model for such voltages or currents in circuits is Normal(µ, σ 2 ).

• It has two parameters:

– µ : average voltage or current

9
– σ 2 : variance in the voltage or current.

• Now, we will take iid samples of the voltage or current. Suppose we have 10 measure-
ments: 1.07, 0.91, 0.88, 1.07, 1.15, 1.02, 0.99, 0.99, 1.08, 1.08
From here we can try to find µ and σ 2

At the end of this section, if somebody gives you iid samples coming from a distribution
with unknown parameters, you should be able to find the parameter. This is a parameter
estimation problem.

3.4 Parameter estimation


• We have n iid samples from a distribution X. i.e.,

X1 , . . . , Xn ∼ iidX

• The distribution of X may not be completely known. Suppose X has some distribution
described with parameters θ1 , θ2 , . . ., where θi ∈ R.

• Parameter estimation problems: What is θ1 ?, what is θ2 ?, . . ..

• Estimator for a parameter θ: a function of the samples X1 , . . . , Xn . We denote the


estimator as θ̂(X1 , . . . , Xn ) for the parameter θ. Mathematically,

θ̂ : (X1 , . . . , Xn ) → R

We call the output to be the estimate of the parameter θ.

• Parameter vs Estimator: θ is a constant parameter, not a random variable while


estimator θ̂ is a random variable. So, θ̂ will have a distribution. Different sampling
will give different values of θ̂ depending on the actual realization of the samples.

We expect θ̂ to take values around θ. We want to design an estimator θ̂ in such a way


such that its distribution is concentrated around θ. This is the way estimation works. In
the coming section, we will see how to come up with the estimators, how to characterize a
good estimator, how to design a good estimator, and so on.

Example 1: X1 , X2 , . . . , Xn ∼ iid Bernoulli(p).

Here, the parameter p is unknown.

We have three different estimators for p.


1
1. Estimator 1: p̂1 =
2

10
X1 + X 2
2. Estimator 2: p̂2 =
2
X1 + . . . + X n
3. Estimator 3: p̂3 =
n
All the three estimators are valid. Estimator is just some function from the samples to
the real line. But the question to ask is which among them is a good estimator.
1
• p̂1 =
2
1
Here, the estimator will always give us the value regardless of the samples. This
2
does not use the samples at all, so it does not seem to be a good estimator, but still it
is a valid estimator.
X1 + X2
• p̂2 =
2
This seems like a reasonably better estimator than the first one because it uses the first
two samples, but still it is just the first two samples. We would want our estimator to
use all the n samples given to you.
X1 + . . . + Xn
• p̂3 =
n
This estimator gives more meaningful information, this is a very smart estimator in a
sense which we will see later. Again, it is a valid estimator.
Estimator is a function from samples to the real line, so we can have many functions, it
means infinite number of estimators are possible. For example: 2X1 − X2 , X1 + 2X2 + 3X3
are also valid estimators. But validity does not imply good. So, how do we characterise a
good estimator? We know that any estimator we come up with should have a distribution
around the real data. In the next section, we will see how to design those estimators and
what metric should we use for that.

4 Estimation error
In the previous section, we saw the parameter estimation problem. What we are trying to
estimate is a constant value, so it will not have a distribution, the estimator will have a
distribution and we are hoping a good estimator will have very little error, it will always
predict values close to the actual value θ. So, it is important to study the error in the
estimation problem process.
In this section, we will see what is error, how do we quantify it, how to control the error,
can we make the probability for high values of error very low, and so on.

Let X1 , X2 , . . . , Xn ∼ iid X with parameter θ. Let θ̂(X1 , . . . , Xn ) be an estimator of θ.


Then, we define error as the difference between the estimator and the parameter, i.e.,
Error : θ̂(X1 , . . . , Xn ) − θ

11
, where error itself is a random variable. We expect that a good estimator will take values
close to the actual θ, i.e., error should be very very close to 0.
Mathematically, P (| Error |> δ) should be small. So, we look at the distribution of the
error and we want it in such a way such that the probability with which it takes very large
values is very small. But how do we pick such a δ, how big should δ be? We can understand
it using few examples.

• Firstly, we will have to understand that the parameter θ will have a certain range. For
example, let X ∼ Bernoulli(p), where 0 ≤ p ≤ 1.

– Suppose we take δ to be 0.01 for Bernoulli(p). We may think that an error of 0.01
is very small, but what if p = 10−5 ? In that case, 0.01 is huge compared to what
p takes.

So, it is good to have the magnitude of error characterised in terms of the parameter
we are estimating. So, if we are estimating p, then the error should not be more than
10% of p. Mathematically, we can write
p
Error ≤
10

• Now, suppose X ∼ Normal(µ, σ 2 ) and we want to estimate µ. µ can take any value
in this case, there is no restriction. In such cases, how do we think of the error? We
can think of it as a fraction of what we are estimating. So, if as a fraction of what you
are estimating is only 10% error or something, then you can be sure that the error is
small.

– Let µ = 1000, and if we estimate it as 1001, then we are good. But if µ = 0.1 and
we are estimating it as 1.1, then it is a huge error.

Example 2: Let X1 , . . . , Xn ∼ iid Bernoulli(p). Consider three different estimators of p.


1
1. p̂1 =
2
X1 + X2
2. p̂2 =
2
X1 + . . . + Xn
3. p̂3 =
n
Now, consider 3 different samplings of size 10 from Bernoulli(p).

• Sampling 1: 1, 0, 0, 1, 0, 1, 1, 1, 0, 0
1
– p̂1 = = 0.5
2

12
1+0
– p̂2 = = 0.5
2
5
– p̂3 = = 0.5
10
• Sampling 2: 1, 0, 0, 1, 0, 1, 0, 1, 0, 0
1
– p̂1 = = 0.5
2
1+0
– p̂2 = = 0.5
2
4
– p̂3 = = 0.4
10
• Sampling 3: 1, 1, 0, 0, 0, 1, 0, 1, 0, 1
1
– p̂1 = = 0.5
2
1+1
– p̂2 = =1
2
5
– p̂3 = = 0.5
10
Observations:
• p̂1 will not work for all values of p. Suppose if p were to be 1, we will just get 1, 1, . . . , 1
even when we see different samples. So saying p̂1 = 0.5 will give an error. Only if p is
very close to 0.5, it will work.
• p̂2 tends to vary a lot with the samples. Sometimes, it takes the value 1, if we take
another sampling where the first two samples are 0, p̂2 will take the value 0. So, the
variation in the estimator is very large, it is going from 0 to 1 to 0.5, jumping all over
the place, it is not staying steady.
• p̂3 seems to be very promising, it stays steady with the different samples, the variation
is less.
When we compare p̂2 with p̂1 , p̂1 seems like a good estimator to have. p̂1 is not able to
adapt to different values of p easily, but at least it does not keep jumping around. On the
other hand, p̂2 may adapt to different values of p, but across different samplings, it seems to
give a wider variation.
So, given any estimation problem and given a bunch of estimators, we should first try to
stimulate the samples. If it is varying too much or if it is holding steady, or if it is working
out for different values of the unknown parameter. The variation in the estimator value
should not be too high. At the same time, it should not be just stuck at one value.

For the above example, we want to find the errors for estimators, their distributions and
the probability that absolute value of error is greater than p/10.

13
1
• p̂1 =
2
1
Error = − p
2

P (| Error |> p/10) = 1 if p < 5/11 or p > 5/9

Error for p̂1

– In the above figure, you can see the plot of | 1/2 − p | and p/10. | 1/2 − p | is
above p/10 most of the time, only between 5/11 and 5/9, the absolute value of
error is less than p/10. When the value of p is very close to 0.5, the absolute value
of error is falling below p/10, for any other values it is always above.
– Error is constant, there is no randomness there. So, probability that error is
5 5
greater than p/10 is equal to 1 if p is either less than or p is greater than .
11 9
We can see that the absolute value of error being greater than p/10 is true for a
large range of p except for p around 0.5. So, estimator 1 is not so good.

X1 + X2
• p̂2 =
2
X1 + X2
Error = −p
2
P (| Error |> p/10) = 1 if p < 5/11 or 5/9 < p < 10/11.

14
x1 x2 e P (Error = e)
0 0 −p (1 − p)2
0 1 1/2 − p p(1 − p)
1 0 1/2 − p p(1 − p)
1 1 1−p p2

Table 2

Error for p̂2

– In the above figure, we can see the plot of e and p/10. Error e is above p/10 most
of the time, only between 5/9 and 10/11, and for p < 5/11, the absolute value
of error is less than p/10. So, the probability of error being greater than 10% is
equal to 1 for a large range of p even in Estimator 2.

X1 + . . . + Xn
• p̂3 =
n
X1 + . . . + Xn
Error = −p
n

– We are interested in finding the probability that the absolute value of error is
greater than p/10. Can we control this in the estimator? We will see that p̂3 gives
a very surprising result.

15
– From Chebyshev’s inequality we know that for any random variable X,

Var(X)
P (| X − E[X] |> δ) ≤
δ2
Since error is a random variable, we can write

Var(Error)
P (| Error − E[Error] |> δ) ≤
δ2
p
– Put δ = in the above bound, we will get
10
Var(Error)
P (| Error − E[Error] |> p/10) ≤
(p/10)2
   
X1 + . . . + Xn X1 + . . . + Xn
E[Error] = E −p =E − E[p] = 0
n n
This turns out to be a desirable property in the estimator, because on an average
the estimator should give you the error 0.
 
X1 + . . . + Xn
Var(Error) =Var −p
n
 
X1 + . . . + Xn
=Var
n
1
= 2 Var(X1 + . . . + Xn )
n
np(1 − p)
=
n2
Therefore,
100(1 − p)
P (| Error] |> p/10) ≤
np

100(1 − p)
• For any fixed p, as n becomes larger and larger, goes to zero. So, Cheby-
np
1
shev’s bound results in the fall of .
n
Why is p̂3 a good estimator?

1. It is using all the samples and with the increase in the number of samples, the perfor-
mance of the estimator increases.

16
2. The accuracy of the estimator keeps improving with increase in the value of n. In case
of estimators p̂1 and p̂2 , there is no n. Probability that error becomes greater than
p/10 was just 1 for some cases. But in case of p̂3 , we are able to make n appear in these
probabilities in a certain way such that for any value of the parameter, probability goes
to 0 with increase in n.

Observations:

• Various estimators are usually possible. Every estimator will have an error and the
error will have a distribution. We expect its distribution to be around 0.

• Bounds on P (| Error |> δ) are interesting and capture useful properties of the estima-
tor.
Good design: Probability that absolute value of error is greater than δ is a very useful
way of characterizing the estimator. P (| Error |> δ) will fall with n.

• Chebyshev bound is a very useful tool.

Var(Error)
P (| Error − E[Error] |> δ) ≤
δ2

• Good design principles: E[Error] should be close to or equal to 0. Also Var(Error)


should tend to 0 with n.

We saw in this section that how estimators will give us the errors. The errors will have
a distribution and with larger and larger number of samples, we can control the magnitude
of the error in the distribution. The probability with which the error becomes very large
can be controlled through some tools like Chebyshev inequality. This gives us a good design
principle. In the next section we will talk about good ways of designing the estimators.

4.1 Bias
X1 , X2 , . . . , Xn ∼ i.i.d X, with parameter θ. Let θ̂ be an estimator of θ.

Definition: The bias of the estimator θ̂ for a parameter θ, denoted Bias(θ, θ̂) is defined as

Bias(θ, θ̂) = E[θ̂ − θ] = E[θ̂] − θ = 0

For different samplings, we get different values of θ̂, we would expect θ̂ to be closed to
θ, so E[θ̂ − θ] gives difference on average. We want to control the distribution of the error.
We say the estimator is unbiased if E[θ̂ − θ] = 0. We want our estimator to be unbiased or
at least the bias should be very low. If on average, the value of the estimate is not small
enough, the estimator is not good.

17
4.2 Risk: Squared error
Definition: The (squared-error) risk of the estimator θ̂ for a parameter θ, denoted Risk(θ, θ̂)
is defined as
Risk(θ, θ̂) = E[θ̂ − θ]2
The bias can be either positive or negative, that can make it a bit misleading or it may
not capture the entire picture. We may be interested more in the amount by which E[θ̂]
differs from θ, this is where risk plays an important role, whether you deviate on the left
or the right, the penalty will be same, it cannot be 0. For example, if we look at E[θ̂ − θ],
θ̂ can be way off to the right of θ for a long time or it can be way off to the left of θ for
a long time, and on an average, we will get 0. So, sometimes even if we have an unbiased
estimator, estimator can be bad and risk is a better way to measure your estimator, which
puts a penalty of (θ̂ − θ)2 . If the risk is small, θ̂ will definitely take values close to θ.
Squared error risk is the second moment of error. Another terminology used for this is
mean squared error (MSE).

4.3 Variance
We want our estimator to have less variance. We want an estimator such that given any
set of samples, it does not deviate too much. In short, we don’t want θ̂ to vary too much.
Variance of an estimator θ̂ for a parameter θ is defined as

Var(θ̂) = E[(θ̂ − E(θ̂)2 ]

Note: Variance of error, θ̂ − θ is same as the variance of estimator θ̂, i.e., Var(θ̂). Error is
just the translated version of θ̂, so variance of error turns out to be same as the variance of
estimator.

4.4 Bias-variance tradeoff


We saw three different terms for defining the characteristics of an estimator, but there is an
important relationship between the three of them, which is called the Bias-variance tradeoff.

Theorem: Bias-variance tradeoff: The risk of the estimator satisfies the following relation-
ship:
Risk(θ̂, θ) = Bias(θ̂, θ)2 + Var(θ̂)
Proof: Risk(θ̂, θ) = E[(θ̂ − θ)2 ] = E[θ̂ − θ]2 + Var(θ̂)

Using this relationship, we can see that if we want to keep the mean squared error small,
we will have to keep both the bias and variance small. Also, we can do some tradeoff between
these two, we can decrease the bias, and increase my variance to balance it out.
While designing an estimator, we should try to reduce the bias. If we reduce the bias,
risk will go down, but sometimes reducing bias can lead to increase in the variance. If we just

18
keep decreasing the bias, variance may blow up, and if we just keep decreasing the variance,
bias may blow up. So, we have to balance between these two, so this tradeoff is very useful.

Example 3: Let X1 , X2 , . . . , Xn ∼ iid Bernoulli(p). Consider three different estimators for


p.
1
1. p̂1 =
2

Solution:

• Bias = E[p̂1 ] − p
1
E[p̂1 ] =E[1/2] =
2
1
Bias = −p
2
 
1
• Variance(p̂1 ) = Var =0
2
• Risk(p̂) = Bias2 + Var(p̂)
Therefore,
 2
1
Risk = −p +0
2
 2
1
= −p
2

X1 + X2
2. p̂2 =
2

Solution:

• Bias = E[p̂2 ] − p
 
X1 + X 2
E[p̂2 ] =E
2
1
= E[X1 + X2 ]
2
E[X1 ] + E[X2 ] 2p
= = =p
2 2
Bias = p − p = 0
 
X 1 + X2 1 1 p(1 − p)
• Variance(p̂2 ) = Var = Var(X1 + X2 ) = (2p(1 − p)) =
2 4 4 2

19
• Risk(p̂2 ) = Bias2 + Var(p̂)
Therefore,

p(1 − p)
Risk =0 +
2
p(1 − p)
=
2
X1 + . . . + Xn
3. p̂3 =
n

Solution:

• Bias = E[p̂3 ] − p
 
X1 + X 2 +
E[p̂2 ] =E
2
1
= E[X1 + X2 ]
2
E[X1 ] + E[X2 ] 2p
= = =p
2 2
Bias = p − p = 0
 
X 1 + X2 1 1 p(1 − p)
• Variance(p̂3 ) = Var = Var(X1 + X2 ) = (2p(1 − p)) =
2 4 4 2
• Risk(p̂3 ) = Bias2 + Var(p̂)
Therefore,

p(1 − p)
Risk =0 +
2
p(1 − p)
=
2

Example 4: Let X1 , X2 , . . . , Xn ∼ iid Bernoulli(p). Consider an estimator p̂ of p as



X1 + . . . + Xn + n/2
p̂ = √
n+ n

Find the bias, variance and risk of p̂.

Solution:

20
• Bias = E[p̂] − p
 √ 
X1 + . . . + Xn + n/2
E[p̂] =E √
n+ n
1 √
= √ E[X1 + . . . + Xn + n/2]
n+ n
1 √ 
= √ E[X1 ] + . . . + E[Xn ] + n/2
n+ n

np + n/2
= √
n+ n
√ √ √ √ √
np + n/2 np + n/2 − np − np n/2 − np
Bias = √ −p= √ = √
n+ n n+ n n+ n
 √   
X1 + . . . + Xn + n/2 X1 + . . . + X n
• Variance(p̂) = Var √ = Var √
n+ n n+ n
Therefore,
1
Var(p̂) = √ Var(X1 + . . . + Xn )
(n + n)2
np(1 − p)
= √
(n + n)2

• Risk(p̂) = Bias2 + Var(p̂)


Therefore,
√ √ 2
n/2 − np np(1 − p)
Risk = √ + √
n+ n (n + n)2
1
= √ (np − np2 + n(1/2 − p)2 )
(n + n)2
  
1 2 1 2
= √ np − np + n +p −p
(n + n)2 4
n
= √
4n + n)2

5 Estimator design approach: Method of moments


In this section, we will look at one of the very popular and simple method of designing the
estimators called Method of moments.

21
5.1 Moments and parameters
Suppose we have a random variable X with some distribution fX (x), either PDF or PMF
with some unknowns which we call as parameters θ1 , θ2 , . . .. Now, given the PMF or a PDF,
we can always compute the moments or the expected value, which can be expressed as a
function of parameters.

For example,

1. X ∼ Bernoulli(p)
E[X] = p

2. X ∼ Poisson(λ)
E[X] = λ

3. X ∼ Exponential(λ)
E[X] = 1/λ

4. X ∼ Normal(µ, σ 2 )
E[X] = µ and E[X 2 ] = σ 2 + µ2

5. X ∼ Gamma(α, β)
α α2 α
E[X] = and E[X 2 ] = 2 + 2
β β β
6. X ∼ Binomial(n, p)
E[X] = np and E[X 2 ] = np(1 − p) + n2 p2

You can see in all the above examples that we can write the moments as a function of
parameters.

5.2 Moments of samples


Let X1 , X2 , . . . , Xn ∼ iid X. k-th sample moments is defined as
n
1X k
Mk (X1 , . . . , Xn ) = X
n i=1 i

Given a sampling instance x1 , . . . , xn ,


1
• First sample moment is m1 = (x1 + x2 + . . . + xn )
n
1 2
• Second sample moment is m2 = (x + x22 + . . . + x2n )
n 1

22
Therefore, sample moment is the average of the k-th power of the samples that we observe.
Note that sample moment is a random variable and will have a distribution. It is not the
same as the distribution moment. Distribution moment is a fixed constant function of some
parameters. We expect Mk to take values around the expected value of E[X k ]. Using WLLN,
CLT, we can argue that the sample moments for larger and larger samples take values close
to their distribution.

5.3 Method of moments


Method of moments exploits the concentration in some sense, the distribution moments and
the sample moments are sort of equal.

• Procedure
– Equate sample moments to expression for moments in terms of unknown param-
eters.
– Solve for the unknown parameters.
• One parameter θ usually needs one moment. This is because we expect the first moment
to be a function of θ. In case, it is a constant, for example, in case of a Normal(0, σ 2 ).
The first moment µ is a constant, which is not a function of θ. We keep looking till we
get the first moment. In this case, that would be the second moment.
– Sample moment: m1
– Distribution moment: E[X] = f (θ)
– Solve for θ from f (θ) = m1 in terms of m1
– θ: replace m1 by M1 in above solution
• Two parameters θ1 , θ2 usually needs two moments
– Sample moments: m1 , m2
– Distribution moments: E[X] = f (θ1 , θ2 ), E[X 2 ] = g(θ1 , θ2 )
– Solve for θ1 , θ2 from f (θ1 , θ2 ) = m1 , g(θ1 , θ2 ) = m2 in terms of m1 , m2 .
– θ̂1 , θ̂2 : replace m1 by M1 and m2 by M2 in above solution.

Example 5: Let X1 , . . . , Xn ∼ iid Bernoulli(p). Find the method of moments estimate of


p.

Solution:

• First sample moment, m1 = E[X] = p


• Method of moments equation
p = m1

23
X 1 + . . . + Xn
• Estimator, p̂ = M1 =
n
Example 6: Let X1 , . . . , Xn ∼ iid Poisson(λ). Find the method of moments estimate of λ.

Solution:

• First sample moment, m1 = E[X] = λ

• Method of moments equation


λ = m1
X1 + . . . + X n
• Estimator, λ̂ = M1 =
n
Example 7: Let X1 , . . . , Xn ∼ iid Normal(µ, σ 2 ). Find the method of moments estimate of
µ and σ 2 .

Solution:

• E[X] = µ, E[X 2 ] = µ2 + σ 2

• Method of moments equation


µ = m1
µ2 + σ 2 = m2
X1 + . . . + X n
• Estimator for µ, µ̂ = M1 =
n
Estimator for σ,
p
σ̂ = m2 − µ̂2
r
X12 + . . . + Xn2 (X1 + . . . + Xn )2
= −
n n2

5.4 Method of moments estimation


Below are few of the examples where we use the method of moments to estimate the param-
eters.

1. Consider a sample 1, 0, 0, 1, 0, 1, 1, 1, 0, 0 from a Bernoulli(p) distribution.

(1 + 0 + 0 + 1 + 0 + 1 + 1 + 1 + 0 + 0) 5
p̂ = = = 0.5
10 10

24
2. Alpha particles emission in 10 sec: Poisson()λ
Number of particles emitted per second = 0.8392

λ̂ = Average number of particles emitted in 10 seconds = 8.392


λ is just the average number of particles emitted in 10 seconds. So, we can count the
total number of particles emitted, divided by the total time to find the average number
of particles per second and multiply by 10. That will also give us 8.392.

3. Normal(µ, σ 2 ) : 1.07, 0.91, 0.88, 1.07, 1.15, 1.02, 0.99, 0.99, 1.08, 1.08

1.07 + 0.91 + . . . + 1.08


µ̂ = m1 = = 1.024
p √ 10
σ̂ = m2 − m21 = 1.05482 − 1.0242 = 0.079

It is not unreasonable for some method like method of moments to work. It is quite
useful, quite often we may not get a handle on the actual distribution itself, we may not
know the distribution, and we have to guess at it. So instead of guessing, we can just use the
moments, the moments can be quite reliably estimated from the samples using the sample
moments.

6 Estimator design approach: Maximum Likelihood


In the previous section, we saw method of moments for designing estimators. In this section,
we are going to look at a very important and interesting principle called maximum likelihood.

6.1 Likelihood of iid samples


Suppose we are observing n iid samples from a distribution fX (x) with unknown parameters,
θ1 , θ2 , . . . . We will write the distribution as fX (x; θ1 , θ2 , . . .), which also shows that the value
of distribution at point x depends on the parameters θ1 , θ2 , . . .. For example, suppose we
have a normal distribution with parameters µ and σ 2 . Its PDF is given by

(x − µ)2
1 −
fX (x) = √ e 2σ 2
σ 2π
We can also write fX (x) as fX (x, µ, σ), to bring out the fact that µ and σ plays a role in
the PDF of a normal distribution.
Samples are usually random variables, so for a particular sampling instance, we know
that those random variables take some actual values. We can denote the actual values
as x1 , x2 , . . . , xn . We denote the likelihood of that actual sampling as L(x1 , x2 , . . . , xn ).

25
Sometimes, the arguments will be dropped, we will just say the likelihood function is L. It
is defined as n
Y
L(x1 , x2 , . . . , xn ) = fX (xi , θ1 , θ2 , . . .)
i=1

The likelihood represents the probability of occurrence of a particular sample. Since the
samples are independent, we can multiply their likelihoods. Likelihood is a function of the
unknown parameters of the distribution.

Example 8: Consider a sampling instance from a Bernoulli (p): 1, 0, 0, 1, 0, 1, 1, 1, 0, 0.


Find the likelihood of the samples.

Solution:

Likelihood, L = p(1 − p)(1 − p)p(1 − p)ppp(1 − p)(1 − p) = p5 (1 − p)5

Example 9: Consider a sampling instance from Normal(µ, σ 2 ): 1.07, 0.91, 0.88, 1.07,
1.15, 1.02, 0.99, 0.99, 1.08, 1.08. Find the likelihood of the samples.

Solution:

Likelihood,
(1.07 − µ)2 (1.08 − µ)2
1 − 1 −
L= √ e 2σ 2 ... √ e 2σ 2
σ 2π σ 2π
 10 −1
1 (1.07−µ)2 +...+(1.08−µ)2
= √ e 2σ 2
2σ 2π

6.2 Maximum likelihood (ML) estimator


Maximum likelihood estimator is maximizing the likelihood function. For a sampling x1 , x2 , . . . , xn ,
maximum likelihood estimator is defined as
n
Y
θ1∗ , θ2∗ , . . . = arg max fX (xi ; θ1 , θ2 , . . .)
θ1 ,θ2 ,...
i=1

The question is Why maximize the likelihood? It is sort of intuitive because the samples
are coming from a distribution with unknown parameters, we are calculating their probabil-
ities. Now out of all the parameters, which parameter gave the maximum likelihood for that
particular sample. This seems like a very natural way to define an estimator.
Therefore, we find the likelihood of a sampling instance and then find the parameter that
maximizes the likelihood.

26
Example 10: X1 , . . . , Xn ∼ iid Bernoulli(p). Find the ML estimate of p.

Solution:

• Samples: x1 , x2 , . . . , xn , where each of the xi ’s can take either 0 or 1.

P (Xi = 1) = p and P (xi = 0) = 1 − p, where i : 1, 2, . . . , n

• Likelihood function, L = pw (1 − p)n−w

• ML estimation:
Log likelihood, log L = w log p + (n − w) log(1 − p)

p∗ = arg max[w log p + (n − w) log(1 − p)]


p

Differentiate log L w.r.t. p and equate it to 0.


w n−w
− =0
p 1−p
=⇒ w − wp = np − pw
w
=⇒ p =
n
w
Therefore, p̂M L =
n
Example 11: X1 , . . . , Xn ∼ iid Poisson(λ). Find the ML estimate of λ.

Solution:

• Samples: x1 , x2 , . . . , xn , where each of the x′i s ≥ 0.

λe−λk
P (X = k) =
k!
n λe−λxi λn e−λ(x1 +...+xn )
• Likelihood function, L =
Q
=
i=1 xi ! x1 !x2 ! . . . xn !
• ML estimation:
Log likelihood,

log L = log(λn e−λ(x1 +...+xn ) ) − log(x1 !x2 ! . . . xn !)


=n log λ − λ(x1 + . . . + xn ) − log x1 ! − . . . − log xn !

λ∗ = arg max[n log λ − λ(x1 + . . . + xn ) − log x1 ! − . . . − log xn !]


λ

27
Differentiate log L w.r.t. λ and equate it to 0.
n
− (x1 + . . . + xn ) = 0
λ
x1 + . . . + xn
=⇒ λ =
n
=⇒ λ = X

Therefore, λ̂M L = X

Example 12: X1 , . . . , Xn ∼ iid Normal(µ, σ 2 ).

1. Find the maximum likelihood estimate of µ.

2. Find the maximum likelihood estimate of σ 2 .

Solution:

• Samples: x1 , x2 , . . . , xn .

(x − µ)2
1 −
P (X = x) = √ e 2σ 2
σ 2π
(xi − µ)2  n 1
1 n
− 1 − [(x1 −µ)2 +...+(xn −µ)2 ]
• Likelihood function, L = 2σ 2
e 2σ 2
Q
√ e = √
i=1 σ 2π σ 2π
• ML estimation:
Log likelihood,
 
1 1
log L =n log √ − 2 [(x1 − µ)2 + . . . + (xn − µ)2 ]
σ 2π 2σ
√ 1
= − n log(σ 2π) − 2 [(x1 − µ)2 + . . . + (xn − µ)2 ]

1. ML estimate of µ

 
∗ 1
µ = arg max −n log(σ 2π) − 2 [(x1 − µ)2 + . . . + (xn − µ)2 ]
µ 2σ

28
Differentiate log L w.r.t. µ keeping σ constant and equate it to 0.
n
d √ 1 X
[−n log(σ 2π) − 2 (xi − µ)2 ] = 0
dµ 2σ i=1
n
1 X
=⇒ 0 + 2 2(xi − µ) = 0
2σ i=1
n
1 X
=⇒ 2 (xi − µ) = 0
σ i=1
n
X
=⇒ Xi − nµ = 0
i=1
x1 + . . . + xn
=⇒ µ = =X
n
Therefore, µ̂M L = X
2. ML estimate of σ 2

 
2∗ 1
σ = arg max −n log(σ 2π) − 2 [(x1 − µ)2 + . . . + (xn − µ)2 ]
σ 2σ

Differentiate log L w.r.t. σ keeping µ constant and equate it to 0.


n
d √ 1 X
[−n log(σ 2π) − 2 (xi − µ)2 ] = 0
dσ 2σ i=1
√ n
n 2π 2 X
=⇒ − √ + 3 (xi − µ)2 = 0
σ 2π 2σ i=1
n
n 1 X
=⇒ − + 3 (xi − µ)2 = 0
σ σ i=1
n
X
=⇒ (Xi − µ)2 = nσ 2
i=1
n
(Xi − µ)2
P
i=1
=⇒ σ 2 =
n
n
(Xi − µ̂M L )2
P

Therefore, σˆ2 M L = i=1


n

6.3 Observations
• Maximum likelihood is a very popular method for deriving estimators.

29
• Theoretically and intuitively appealing: maximize the probability or likelihood of the
observed samples.

• Deriving the actual estimator needs some careful calculus.

• Numerous questions

– How do ML estimators look? They seem similar to MME, so far.


– How does MME compare with ML? How to compare estimators?

7 Finding MME and ML estimators


Example 13: X1 , X2 , . . . , Xn ∼ iid Exp(λ). Find the method of moments estimate and
maximum likelihood estimate of (λ).

Solution:

• Using Method of moments


1
Distribution moment, µ =
λ
X1 + X2 + . . . + Xn
Sample moment, X =
n
n
Therefore, λ̂M M E =
X 1 + . . . + Xn
• Using Maximum likelihood
n
λe−λxi = λn e−λ(x1 +...+xn )
Q
– Likelihood function, L =
i=1

– Log likelihood, log L = n log λ − λ(x1 + . . . + xn )

λ∗ = arg max[n log λ − λ(x1 + . . . + xn )]


λ

Differentiate log L w.r.t. λ and equate it to 0.


n
− (x1 + . . . + xn ) = 0
λ
n
λ=
x1 + . . . + xn
n
Therefore, λ̂M L =
X1 + . . . + X n

30
Example 14: Let X1 , X2 , . . . , Xn ∼ iid X, where the distribution of X is
x 1 2 3
, where p1 + p2 + p3 = 1, and 0 < pi < 1.
fX (x) p1 p2 p3

Find the method of moments estimate and maximum likelihood estimate of p1 , p2 and
p3 .

Solution:

• Using Method of moments

First sample moment, m1 = X


1P n
Second sample moment, m2 = X2
n i=1 i
E[X] = p1 + 2p2 + 3p3
E[X 2 ] = p1 + 4p2 + 9p3

m1 =p1 + 2p2 + 3p3


=p1 + 2p2 + 3(1 − p1 − p2 ) [Using p1 + p2 + p3 = 1]
=3 − 2p1 − p2 (1)

m2 =p1 + 4p2 + 9p3


=p1 + 4p2 + 9(1 − p1 − p2 ) [Using p1 + p2 + p3 = 1]
=9 − 8p1 − 5p2 (2)

Solving (1) and (2), we get


1 5
p1 = m2 − m1 + 3
2 2
p2 = 4m1 − m2 − 3

Therefore, MME estimator of p1 and p2 are


1 5
pˆ1 M M E = M2 − M1 + 3
2 2
pˆ2 M M E = 4M1 − M2 − 3

• Using Maximum likelihood

31
– Likelihood function, L = pw1 w2
1 p2 (1 − p1 − p2 )
n−w1 −w2

,where w1 : number of times 1 appear in the sample


w2 : number of times 2 appear in the sample
– Log likelihood, log L = w1 log p1 + w2 log p2 + (n − w1 − w2 ) log(1 − p1 − p2 )

p∗1 p∗2 = arg max[w1 log p1 + w2 log p2 + (n − w1 − w2 ) log(1 − p1 − p2 )]


p1 ,p2

Differentiate log L first w.r.t. p1 keeping p2 constant and equate it to 0.


w1 + w2 − n w1
+ =0 (3)
1 − p1 − p2 p1
Differentiate log L first w.r.t. p2 keeping p1 constant and equate it to 0.
w1 + w2 − n w2
+ =0 (4)
1 − p1 − p2 p2
w1 w2
Solving (3) and (4), we get p1 = , p2 =
n n
w1 w2
Therefore, p̂1,M L = , p̂2,M L = , p̂3,M L = 1 − p̂1,M L − p̂2,M L
n n
Example 15: X1 , . . . , Xn ∼ Uniform [0, θ]. Find the method of moments estimate and
maximum likelihood estimate of θ.

Solution:

• Using Method of moments

First sample moment, m1 = X


θ
E[X] =
2
θ
Now, m1 = =⇒ θ = 2m1
2
Therefore, MME estimator of θ is
X 1 + . . . + Xn
θ̂M M E = 2
n

• Using Maximum likelihood


(
1/θn , 0 < x1 , . . . , xn < θ
Likelihood function, L =
0, otherwise

32
1
θ∗ = arg max
θ θn
In order to maximize L, we need to pick the least possible value of θ.
Since x1 , . . . , xn < θ, the least possible θ will be max(x1 , . . . , xn ).
Therefore, θ̂M L = max(X1 , . . . , Xn ).

Example 16: X1 , . . . , Xn ∼ Uniform {1, 2, . . . , N }. Find the method of moments estimate


and maximum likelihood estimate of N .

Solution:

• Using Method of moments

First sample
 moment,
  m1 = X  
1 1 1 1 N +1
E[X] = 1 +2 + ... + N = (1 + 2 + . . . + N ) =
N N N N 2
N +1
Now, m1 = =⇒ N = 2m1 − 1
2
Therefore, MME estimator of N is
 
X1 + . . . + Xn
N̂M M E = 2 −1
n

• Using Maximum likelihood


(
1/N n , 1 . . . x1 , . . . , xn . . . N
Likelihood function, L =
0, otherwise
1
N ∗ = arg max
N Nn
In order to maximize L, we need to pick the least possible value of N .
Since x1 , . . . , xn . . . N , the least possible N will be max(x1 , . . . , xn ).
Therefore, N̂M L = max(X1 , . . . , Xn ).

Example 17: X1 , . . . , Xn ∼ iid X, where X is Gamma(α, β).

β α α−1 −βx
fX (x) = x e
Γ(α)

Find the method of moments estimate and maximum likelihood estimate of α and β.

Solution:

33
• Using Method of moments

First sample moment, m1 = X


1 Pn
Second sample moment, m2 = X2
N i=1 i
α
E[X] =
β
α α2 α
E[X 2 ] = 2 + 2 [Since Var(X) = 2 ]
β β β

Now,
α
m1 = =⇒ α = βm1 (5)
β

α α2
m2 = +
β2 β2
βm1 (βm1 )2
= 2 + [Using (5)]
β β2
m1
= + m21
β

m1
β= (6)
m2 − m21

m21
 
m1
Now, α = βm1 = m1 =
m2 − m21 m2 − m21

Therefore, MME estimator of α is


M12
α̂M M E =
M2 − M12

and MME estimator of β is


M1
β̂M M E =
M2 − M12
• Using Maximum likelihood

– Likelihood function,
n  α n
Y β α α−1 −βxi β
L= xi e = (x1 x2 . . . xn )α−1 e−β(x1 +...+xn )
i=1
Γ(α) Γ(α)

34
– Log likelihood,

log L = nα log(β) − n log(Γ(α)) + (α − 1) log(x1 x2 . . . xn ) − β(x1 + . . . + xn )

α∗ , β ∗ = arg max [nα log(β)−n log(Γ(α))+(α−1) log(x1 x2 . . . xn )−β(x1 +. . .+xn )]


α,β

Differentiate log L first w.r.t. α keeping β constant and equate it to 0.


n ′
n log(β) − Γ(α) + log(x1 . . . xn ) = 0
Γ(α)
Pn
′ log(xi )
Γ (α) i=1
=⇒ − log(β) =
Γ(α) n

Differentiate log L first w.r.t. β keeping α constant and equate it to 0.



− (x1 + . . . + xn ) = 0
β
Pn
xi
i=1
=⇒ α = β
n
Example 18: X1 , X2 , . . . , Xn ∼ Binomial(N, p). Find the method of moments estimate and
the maximum likelihood estimate of N and p.

Solution:

• Using Method of moments

First sample moment, m1 = X


1 Pn
Second sample moment, m2 = X2
N i=1 i
E[X] = N p
E[X 2 ] = N p(1 − p) + N 2 p2 [Since Var(X) = N p(1 − p)]

Now,

m1 = N p (7)

m2 =N p(1 − p) + N 2 p2
=m1 (1 − p) + m21 [Using (7)]

35
m21 + (m1 − m2 )
p= (8)
m1

m21
Now, N =
m21 + (m1 − m2 )

Therefore, MME estimator of N is


M12
N̂M M E =
M12 + (M1 − M2 )

and MME estimator of p is


M1
p̂M M E =
M12 + (M1 − M2 )

• Using Maximum likelihood

– Likelihood function,
n  
Y N
L= pxi (1 − p)N −xi
i=1
x i
    
N N N x1 +...+xn
= ... p (1 − p)nN −(x1 +...+xn )
x1 x2 xn
– Log likelihood,
     
N N N
log L = log +log +. . .+log +(x1 +. . .+xn ) log p+(nN −(x1 +. . .+xn )) log(1−p)
x1 x2 xn

N ∗ , p∗ = arg max log L


N,p

Differentiate log L first w.r.t. p keeping N constant and equate it to 0.


(x1 + . . . + xn ) (nN − (x1 + . . . + xn ))
− =0
p 1−p
=⇒ x1 + . . . + xn = npN
x1 + . . . + xn
=⇒ p =
nN
If we differentiate log L first w.r.t. N keeping p constant, we will get a very
complicated expression.
X1 + . . . + Xn
Therefore, p̂M L =
nN

36

You might also like