0% found this document useful (0 votes)
18 views18 pages

Lect w6m12 f2023

The document outlines the key concepts and methodologies in inferential statistics, focusing on sampling distributions, the Central Limit Theorem, point estimates, and confidence intervals. It emphasizes the importance of understanding sampling variability, measurement reliability, and the processes involved in statistical analysis. Additionally, it provides guidelines for calculating sample sizes and interpreting statistical results in relation to population characteristics.

Uploaded by

louiscollomp2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views18 pages

Lect w6m12 f2023

The document outlines the key concepts and methodologies in inferential statistics, focusing on sampling distributions, the Central Limit Theorem, point estimates, and confidence intervals. It emphasizes the importance of understanding sampling variability, measurement reliability, and the processes involved in statistical analysis. Additionally, it provides guidelines for calculating sample sizes and interpreting statistical results in relation to population characteristics.

Uploaded by

louiscollomp2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 18

RMET101, Fall 2023

Methods and Statistics I


Lectures by Dr. Joseph Resovsky
University College Roosevelt
Week 6, Meeting 12
Asking For Information Gathering Information Assessing Information
Why to ask The correct measurement How does it look?
(philosophy) (levels, reliability,construct validity) (descriptives)

What to ask The correct process Looking deeper


(vocabulary) (internal validity, design) (bivariates, normalcy, reliability)

How to ask The correct subjects What answers does it give?


(rules) (external validity, sampling) (hyp. Testing)

• “What does the sample say about the population?” (P&D,


chapters 8,9) (meeting 12)
– Specific questions:
• What do the center and variability statistics of the sample distribution tell us
about the center and variability of the population?
• What do xmean (x with a bar over it) and s say about m and s ?)
• What does sample proportion ƥ (small p with hat) say about population proportion
p (small p with no hat) [please note: proportion = (# in category) / (# in
sample) or (# in category) / (# in population)]
(original and much better and more logical version of textbook: what does sample proportion p say about population
proportion p. Sorry… blame the textbook authors, not me!)
– Key concepts: sampling variability and distribution
• different samples from same population have different statistics (e.g. different
means)
• Just as variable x changes from element to element within the sample (or
population!) to make a distribution, statistic x mean changes from sample to sample
to make a distribution of means.
• Sample to Sample variability depends on (a) element to element variability in the
population and (b) size of the sample
– Our secret weapon: Central Limit Theorem
Asking For Information Gathering Information Assessing Information
Why to ask The correct measurement How does it look?
(philosophy) (levels, reliability,construct validity) (descriptives)

What to ask The correct proces Looking deeper


(vocabulary) (internal validity, design) (bivariates, normalcy, reliability)

How to ask The correct subjects What answers does it give?


(rules) (external validity, sampling) (hyp. Testing)

• Sample stats vs SampLING stats


o x͞ = sample mean of the variable x
o mx = population mean of the variable x
o m x͞ = “mean of all possible sample means of the variable x
for the same sample size” = “ population mean of x͞ “
o sx = sample standard deviation of the variable x [with 1/(n-1)]
o sx = population standard deviation of the variable x [with 1/N]
o s x͞ = “standard deviation of sample means of the variable x for the
same sample size” = “ standard error of the (sample) means x͞ “
o ƥA = sample proportion of elements in category A
o pA = population proportion of elements in category A
o mƥ= mean of all sample proportions for the same sample size”
o s = standard deviation of all possible sample proportions for the same sample size =
Asking For Information Gathering Information Assessing Information
Why to ask The correct measurement How does it look?
(philosophy) (levels, reliability,construct validity) (descriptives)

What to ask The correct proces Looking deeper


(vocabulary) (internal validity, design) (bivariates, normalcy, reliability)

How to ask The correct subjects What answers does it give?


(rules) (external validity, sampling) (hyp. Testing)

• Central Limit Theorem part 1


– Main idea: distributions of sample statistics are usually better
behaved than distributions of variables in a sample. (“Poll of Polls”)
– Supporting Theories [for sample size N; using “approximately equal”
symbol ≈)
• Rule 1: m = mx ≈ x͞ (point estimate of pop mean)

• Rule 2: s x͞ = sx / sqrt(N) ≈ sx / sqrt(N) (s is point estimate of pop s.d. s )
x x

• Rule 3: distribution of means is normal when the population distribution


of the variable is normal OR sample size is LARGE (even if pop dist
NOT normal)
– Supporting Theories for Proportions [for sample size N; using
“approximately equal” symbol ≈)
• Rule 1: mƥ= pA ≈ ƥA (point estimate of pop proportion in cat. A)
• Rule 2: s ƥ = sqrt(p*(1-p) / N) ≈ sqrt(ƥ*(1-ƥ) / N)
• Rule 3: distribution of proportions is normal if the sample size is large
Asking For Information Gathering Information Assessing Information
Why to ask The correct measurement How does it look?
(philosophy) (levels, reliability,construct validity) (descriptives)

What to ask The correct proces Looking deeper


(vocabulary) (internal validity, design) (bivariates, normalcy, reliability)

How to ask The correct subjects What answers does it give?


(rules) (external validity, sampling) (hyp. Testing)

• Central Limit Theorem part 2


– The Cool Rule: CLT (Central Limits Theorem)
• Rule 4: for big samples, the distribution of sample means will be normal
even when the distribution of the variable in the population (or samples)
is not normal
• Implication: standard error z-scores can use Empirical Rule probability
tables even when standard deviation z-scores cannot!
– Use: guessing how likely the true mean is a particular distance away from
from the sample mean
– The remaining problem: we still need m and s to know sampling dist
•  use point estimates
• For means, this requires using T distributions (normal curve adjusted for
sample size using degrees of freedom (measure of uncertainty
introduced by point estimates) [for one sample: df = n-1]
• For one-sided RQ and two-category proportions, still use z-score
• For two-sided and/or 3 or more categories, use chi-squared distributions
with another type of degree of freedom sample size adjustment
Asking For Information Gathering Information Assessing Information
Why to ask The correct measurement How does it look?
(philosophy) (levels, reliability,construct validity) (descriptives)

What to ask The correct process Looking deeper


(vocabulary) (internal validity, design) (bivariates, normalcy, reliability)

How to ask The correct subjects What answers does it give?


(rules) (external validity, sampling) (hyp. Testing)

• Central Limit Theorem Part 3: Large Sample Size Rules


– Sample Size Rule of Thumb for MEANS: n > 30 good enough for central
limit theorem assumption of sampling dist. Normalcy/normality.
– The OTHER possible justification for assuming normalcy/normality:
• For small samples, sampling normalcy can be assumed if and only if there is
reason to believe that the population distribution is normal for that variable
• Best reason is a priori knowledge of the nature of the population dist.
• Next best justification for small sample normalcy assumption is demonstration
that the sample is reasonably normal (boxplots) [weak excuse, but often used!]
– Sample size Rule of thumb for PROPORTIONS
• When just two categories are used BOTH n * p > 10 and n * (1- p) > 10
– This means that the results of the actual sample should have at least 10 of each of the
two options
– In practice, easiest to use smallest null hypothesis proportion p H alone to guess the
sample size needed to make sure pH * N >10
– BETTER to use null hypothesis proportions and standard errors to guess the sample
size needed to be 95% certain that (p H – 2*sp )* N > 10
• If RQ is TWO-sided …, then somewhat smaller samples are allowed: just need
smallest pH * N > 5
• POINT ESTIMATES and CONFIDENCE INTERVALS
– C.I. is range of sufficiently likely values for population characteristic
based on a single sample [example: 45+/-3 or, equivalently, (42,48)]
CI = P.E. +/- [(critical value)*(standard error)]
– More about PEs
• Most important Examples of PEs
--
– X or trimmed mean median or mode for m
– p^ for p (p hat for p)
– s for s [this is unbiased only because of the (n-1) in s formula!]
• Some PEs from a single sample will always be too low/too high = biased
– Example: s would be a biased (low) estimate of s if 1/n was in both formulas.
– Example: range of sample always smaller than range of population
• Some PEs will have more sample to sample variability than others for
some distributions
– Example: trimmed mean less variability than mean for heavy-tailed
distribution (due to outliers:Some samples have excess left-tail, some excess right-
tail)
• Should choose unbiased estimate with smallest variance
• POINT ESTIMATES and CONFIDENCE INTERVALS
– More about CI range calucations (Standard Error * critical value)
• Critical value determined (in combination with Textbook appendix Table 3) by
– Selected Confidence Level desired for the conclusion
» There is an CL% chance that the population characteristic is within the CI
» Examples 80%, 90%, 95%, 99%, 99.9%
» smaller CL yields smaller (narrower) CI
– The type of statistic concerned (a mean, a proportion, or the difference between means or
proportions of two samples)
» Proportions  z-critical
» Means  t-critical (when population stdev unknown… wider than z distribution due to
extra uncertainty of using sample to estimate stdev as well as to estimate the mean!)
» Means  z-critical (when population stdev known)
– For t-critical, also depends on degrees of freedom (df)
» accounts for extra uncertainty of using sample s to estimate standard error!
» df = (Sample size – 1) for single sample, messy formula for two sample
» Larger df (e.g. by using z instead of t) yields smaller (narrower) C.I
• standard error of the statistic used for the point estimate
– SE is estimate of the standard deviation of the PE sampling distrib
– SE is not the standard deviation of the sample or populations
– For mean SE = sqrt [s2 / n ] which is estimate of sqrt (s2 / n )
– For proportion sqrt [ƥ(1-ƥ)/n] (formula using SAMPLE proportion) which is
estimate of sqrt [ p(1 – p) / n ] (formula using true POPULATION proportion)
– In both cases larger sample size yields smaller (narrower) C.I. due to sqrt(1/n)
Asking For Information Gathering Information Assessing Information

Why to ask The correct measurement How does it look?


(philosophy) (levels, reliability,construct validity) (descriptives)

What to ask The correct process Looking deeper


(vocabulary) (internal validity, design) (bivariates, normalcy, reliability)

How to ask The correct subjects What answers does it give?


(rules) (external validity, sampling) (hyp. Testing)

• POINT ESTIMATES and CONFIDENCE INTERVALS… final notes


– T-distributions are wider for lower degrees of freedom (df)
• df = n-1 (the one is for the estimate of standard deviation)
• Higher degrees of freedom  closer to z-curve
– Above df=200 they are practically the same
– Proportion intervals wider when p closer to 50%
• Big samples needed to guess winners of close races!
– Can use desired 95% CI and known or guessed standard error to find
sample size needed
• Define desired bound B = 1.96*sqrt (s2 / n )
• Solve for n = (1.96*s/B)2
– Section 9.4 is worth reading!
 mean +/- smean*[table t-critical value for df and chosen CL]
 p +/- sp* [table z-critical value for chosen CL]
Probability Calculations for Sampling
distributions
• Question types: “Given the population mean/proportion (and, if relevant, stdev) of variable x / in
category A what is the chance of a new sample of a certain size having a sample mean/proportion
above/below given value xref pref ”
• Mathematical version of question: Given sample size N and pA or given sample size and mx and sx ;
what is P(x͞ >< xref )
or P(ƥA >< pref )
• Mathematical answers:
P(ƥA >< pref ) = P(z >< [( pA– pref)/ sƥ])
with s ƥ = sqrt(pref*(1-pref) / N)

P(x͞ >/< xref ) = P(z >< [(xref – mx)/ s x͞ ])


with s x͞ = sx / sqrt(N)
And now for some things from the lectures of Dr.
Sklad, which I get to use when I substitute-teach
for him!!!
Learning Goals = Explain and
practice with inferential Statistics
• Week 6, meeting 12: Explain and practice with inferential Statistics
• Students:
• Know what a sampling distribution, sampling variability looks like
• Can describe the general properties of a sampling distribution of a mean, of a proportion
• Know what the Central Limit Theorem is
• Know the difference between a population and a sampling distribution
• Are aware of the objectives of inferential statistics
• Can apply a point estimation of a mean, variance.
• Know what a trimmed mean, unbiased statistic is
• Know which criterium is most important when choosing an unbiased statistic
• Can calculate and interpret the confidence interval on a mean
• Can calculate and interpret the confidence interval on a proportion
• Know how to deal with the situation when the standard deviation of a population is unknown,
when calculating a CI
• Know what a t-distribution, degrees of freedom is, what the most important properties are
• Can calculate the appropriate sample size for a CI for the estimation of a CI, and when a certain
B is required
Sampling Distribution = Distribution of the statistic in
different samples, e.g. distribution of means

1 2 3 4 5

1 2 3 4 5 Original distribution
Sampling distribution
n=2

1 2 3 4 5
2 3 4
Sampling distribution n = 4 Means (n=60)
Example: sampling distribution
of proportion
Some-
never times
always
Exact sampling mean
= (12*0.0 + 16*0.5 + 2*1.0)/ 30 Persons
population palways = 0.333
=(0+8+2)/30 = 10/30 = 0.333 = AB
CD
Persons
ef
True population proportion!
CLT sampling Not always Always
=0 =1
Standard deviation
of sample N=2 possibilities:
proportion if n*p ƥalways= 0 for AB,BA,AC,CA,AD,DA
was > 10 (it isn’t!)
BC,CB,BD,DB,CD,CD 12
= sqrt[(.333*.667)/2]
= 0.3333 outcomes
ƥalways= 0.5 for Ae,eA,Af,fA,Be,eB,Bf,fB
Ce,eC,Cf,fC,De,eD,Df,fD 
ƥ = 0.0 ƥ = 0.5 ƥ = 1.0
16 outcomes
Exact sampling Standard deviation ƥalways= 1.0 for ef,fe  2 outcomes
= sqrt[{12*(1/3)*(1/3)+16*(1/6)*(1/6)+2*(2/3)*(2/3)}/2]
= 0.298 … not really that far off!
Example: z-probability for sampling vs sample (or pop)
All boxes distributions
from one
company is
A food company sells “18 ounce” boxes
the of cereal. Let x denote the actual amount
population of
products of cereal in a box of cereal. Suppose that
from that x is normally distributed with µ = 18.03
company (or a _
sample of all
produced by any
ounces and  = 0.05. (or x=18.03, s=0.05)
company)
a) What proportion of the boxes will
contain less than 18 ounces?
 18  18.03 
P(x  18) P  z  
 0.05 
P(z   0.60) 0.2743
Example - continued
One case is b) A case consists of 24 boxes of cereal. What
one sample
of 24 boxes is the probability that the mean amount of
from the cereal (per box in a case) is less than 18
population of ounces?
boxes; mean
amount per The central limit theorem states that the
x
box ( ) for
that set of 24 x
distribution of is normally distributed so
is one Note: If it were not stated
observation on the previous page that the
from the population distrib. was normal,
we could NOT use this formula  18  18.03 
distribution
…unless… the sample size was
P(x  18) P  z  
of all  0.05 24 
larger… perhaps 44 instead of 24.
possible This is because the CLT states that P(z   2.94) 0.0016
means from the sampling distribution of mean
all 24-box is normal if population distribution is
case normal OR sample size is large
The 95% Confidence Interval for p

When n is large, a 95% confidence


interval for  is
 p(1  p) p(1  p) 
 p  1.96 , p  1.96 
 n n 

p(1  p)
p z critical value 
n
Required Sample Size
(for a specified width B of the bound, and a rough guess p
of the true population proportion)
2
 1.96 
n (1  )  
 B 
The bound on error of estimation, B,
associated with a 95% confidence interval is
(1.96)·(standard error of the statistic).
The bound on error of estimation, B, associated
with a confidence interval is
(z critical value)·(standard error of the statistic).
** note that since 0.5*0.5 = 0.25, while 0.1*0.9 = 0.09… n needed is lower when
one of the two options applies to a relatively small minority than when there is
a roughly “50-50” balance of the two options **

You might also like