Statistic
Statistic
A unit of analysis is the entity that carries the information needed to realize the scope of a study.
Examples for units of analysis can be I an unemployed person (micro level), I a voter (micro), I
an organization (meso), I a country (macro).
The data in Figure represent a data matrix, which is a convenient and common way to organize
data, especially if collecting data in a spreadsheet. Each row of a data matrix corresponds to a
unique case (observational unit), and each column corresponds to a variable.
Types of variables
A variable is a characteristic that can take on a set of values for each observation in a sample or
population.
A variable is discrete if the set of possible values only allows separate values (normally
integers) such as 1, 2, 3, etc. An example is ‘age in years’.
A variable is continuous if the set of possible values is an infinite continuum on the real
number scale. An example is ‘CO2 emissions in tons’.
A variable is nominal if the set of categories it can take on have no logical ordering. An
example is ‘hair color’.
A variable is ordinal if the set of categories can be clearly ordered from highest to lowest.
An example is ‘life satisfaction’.
If two variables are not associated, then they are said to be independent. That is, two variables
are independent if there is no evident relationship between the two.
Population
The population is the complete set of units of analysis for which the spatial, temporal, or factual
selection criteria apply.
A population is called finite if the number of units of analysis can be counted. Examples:
Voters in Botswana, Students at a university, etc.
A population is called infinite if there is (potentially) an infinite number of observations.
As one of its major contributions, statistics uses data from a sample to make estimates and test
hypotheses about the characteristics of a population through a process referred to as statistical
inference. This generalization from a small sample is called inference.
A sample represents a subset of the cases and is often a small fraction of the population. Each
unit of analysis from the population for which we actually obtain data is called an observation.
All observations together make up the sample. For all observations in the sample, we register
certain quantities of interest to us in a dataset. These quantities of interest are called variables.
A parameter is a numerical summary of the population, e.g. the mean of male/female heights. A
statistic is the numerical summary of the sample data.
Wouldn’t it be best to include the entire population in the study instead of taking a sample? This
is called a census. There are a few (obvious) problems with taking a census:
Non-response: If too many people in a small sample don’t respond, this might lead to
problems. More importantly, it might be that people sharing certain characteristics are
harder to sample.
Voluntary response: Occurs when only people are sample who volunteer to participate.
These might be for example people who might have a different (stronger?) view on
certain issues and the sample will not be representative for the population.
Convenience sample: Includes only individuals who are easily accessible. For instance, if
a political science student samples only her peers for a small study.
1. Simple Random Sample = in simple random samples, every member of a population has
an equal chance of being included in the sample.
2. Stratified Random Sample = take into account different layers within the population.
We ensure that our sample is representative of the population by also keeping the same
ratio of sample sizes as the ratio of the population strata. Stratified sampling is beneficial
to ensure that enough data from each category is collected to make a meaningful
inference, and it also allows us to draw conclusions within each of the groups.
1. Cluster Sample = focus on representative subsets of the population. (prima divide in
gruppi randomici e poi scelgo random da quei gruppi)
The sampling error of a statistic equals the difference between the predicted value obtained by a
sample, and the population parameter. For instance, if a polling organization predicts a party to
win 33% of the votes on election day, but the party actually wins 36%, then the sampling error is
3%.
Frequency
The absolute frequency of a category i is the number ni of times this category occurs in the
dataset.
The relative frequency of a category is the proportion of percentage of observations falling into
n
that category. Mathematically, this is simply expressed as follows: f i= i
n
The pie chart provides another graphical device for presenting relative frequency and percent
frequency distributions for categorical data.
In addition, we could also compute cumulative frequencies. Yet this only makes sense for
categorial and ordinal data. The cumulative frequency Fi is the ‘running total’ of frequencies.
Histogram
A graph which displays the classes on the x-axis and the frequencies on the y-axis. The
histograms tell us something about the distribution of the variable in question.
We are, on the one hand, interested in the modality of the distribution, i.e. how many
peaks the distribution has.
On the other hand we can also analyse the skewness of the distribution, i.e. how
asymmetric the distribution is.
MODE
We can say that the centre of the distribution is the most common class. Such a measure is called
the mode. The mode is defined as the most common value in a data set. The modal class of a
distribution is the class with the highest frequency.
MEDIAN
Alternatively, we can say that the centre of the distribution is the value that splits the distribution
in half. This value is called the median. As mentioned, the median is the point
at which 50% of the values lie below it. Therefore, it is also called the 50th percentile.
The median is much more robust to outliers than the mean. This means, single unusual values do
not influence the median greatly, while they do have an impact on the mean. For symmetrical
distributions, the mean and the median are the same. For skewed distributions, the mean lies
further in the direction of the skew than the median.
MEAN
The mean is defined as the sum across all observations, divided by the number of observations
The mean is only appropriate for numerical data, and it is not sensible to compute the
mean for categorical data, even if they have numbers associated with every category. For
instance, when you ask respondents in a survey whether they fully agree (1), agree (2),
disagree (3), or fully disagree (4), you should not compute the mean at the end for the full
dataset.
The mean is sensitive to outliers in the data, i.e. single unusual observations may have a
strong influence on the mean.
The mean is the only single number for which the residuals (deviations from the estimate)
sum to zero.
The sum of squared residual is minimized at the mean.
This means that the mean is the best choice if you want to present the ‘typical value’ of
the data.
Interquartile Range
The 25th, 50th, and 75th percentiles are named quartiles. The interquartile range (IQR),
calculated as the difference between the first (Q1) and third (Q3) quartiles (i.e., the 25th and 75th
percentiles), is another measure of the variability of a data set. The interquartile range gives us a
view of how the middle 50% of the data are spread. Higher values of the interquartile range
imply a larger spread of the data. A graphic representation of this interquartile range is called a
box plot.
Variance
The variance is roughly the average squared deviation from the mean, or roughly the mean
squared error
Standard deviation
The variance is the average of the squared distances from the mean. Thus, it has a different scale
than the original data. When we talk about the square root of the variance, we are back at the
original scale of the data. This is called the standard deviation.
The standard deviation is the square root of the variance and has the same scale as the data.
Random Process = A random process is a situation in which we know the possible outcomes
that could happen, but we don’t know which particular outcome will happen for a given
observation.
PROBABILITY
The probability of an outcome is the proportion of times the outcome would occur if we
observed the random process an infinite number of times.
certainty about our estimate increases. ⇒ Larger sample sizes are better!
What the Law of Large Numbers actually tells us, is that as we increase the sample size, our
Sample Space = collection of all possible outcomes of a trial. A student has sat a pass/fail exam.
What is the sample space for the outcome of the exam? S = {P, F}
Disjoint (mutually exclusive) outcomes: Cannot happen at the same time. The outcome of a
single coin toss cannot be a head and a tail.
Non-disjoint outcomes: Can happen at the same time. A student can get an A in Stats and A in
Econ in the same semester.
Independence = Two processes are independent if knowing the outcome of one provides no
useful information about the outcome of the other.
MARGINAL PROBABILITY
The marginal probability is the probability of A occurring, p(A), and may be thought of as the
unconditional probability (i.e. it does not depend on another event). The event in the marginal of
a table.
What is the probability that a (depressive) patient
relapsed?
48
P ( relapsed )= ≈ 0.67
72
JOINT PROBABILITY
The joint probability is the probability of A and B occurring in the same event, p (A ∩ B). This is
also called the intersection of A and B. The ones inside the table.
CONDITIONAL PROBABILITY
P(A∩B)
P ( A|B )=
P(B)
What is the probability of relapsing, given that the patient takes desipramine?
10
P ( relapsed ∩desipramine ) 72 10
P ( relapsed|desipramine )= = = ≈ 0.42
P ( desipramine ) 24 24
72
RANDOM VARIABLES
A random variable can take on (at least two) different values, and each possible outcome is
associated with a probability that it occurs. Conventionally, we use a capital letter, like X, to
denote a random variable. The values of a random variable are denoted with a lower-case letter,
in this case x.
Thus, we can write out probabilities that x occurs, P (X = x). For example, rolling a die once can
take on the values X = {1, 2, 3, 4, 5, 6} (this is the sample space). We want to know the
probability that an event smaller or equal to x = 2 occurs. P (X ≤ 2) = 1/3
EXPECTATION
The expected value of a random variable is the long-run average of all possible outcomes. In
other words, this is the mean. The expected values do not tell us anything about the short-run,
e.g. which number will occur in the next role of the die, but about what we should expect on
average over many trials. Mathematically, we express the expected value as follows:
In a game of cards, you win $1 if you draw a heart, $5 if you draw an ace (including the ace of
hearts), $10 if you draw the king of spades and nothing for any other card you draw. Write the
probability model for your winnings, and calculate your expected winning.
VARIABILITY
As in the case of a sample we discussed last time, we can also compute the long-run variance and
standard deviation for a random variable.
Note that σ and σ 2 are for the very long-run or the population, while s and s2 are for a sample.
For this reason, σ 2 is divided by n instead of n – 1
A probability distribution of a discrete random variable lists all possible outcomes and their
respective probabilities. Rules for probability distributions:
Continuous variables have an infinite number of possible outcomes. Therefore, the probability
for every potential outcome is defined as 0, and we cannot use these values directly. Instead, we
need to define intervals of numbers, and assign probabilities between 0 and 1 to these intervals.
The resulting graph of such a distribution is a smooth, continuous curve. The fraction of the area
under the curve of an interval represents the probability that the variable takes on a value
contained in the interval.
Since height is a continuous numerical variable, it would be better to represent the distribution as
a continuous smooth curve instead of bins. We call this curve the probability density function.
NORMAL DISTRIBUTION
The normal distribution is symmetric, bell shaped, and characterized by its mean µ and the
standard deviation σ. No matter which normal distribution we look at, the probabilities of falling
within 1, 2, 3, x standard deviations of µ remain the same.
Z-Scores
The z-score for a value x of a random variable X is the distance of that value from µ expressed in
standard deviations.
These z-scores are also called standardizes scores, since they rescale distances from the mean of
all normal distribution to that of the standard normal distribution N (0, 1). Thus, z-scores make
comparison between normal distributions possible. Z-scores can also be computed for all other
(continuous) distributions, yet we cannot use them to calculate percentiles and to compare
distributions.
SAT and ACT scores (two standardized tests used for college admission in the US) are both
We want to compare two students: Pam, who earned an 1800 on her SAT, and Jim, who scored a
24 on his ACT.
Let’s standardize both of these scores:
1800−1500 300
Pam: z= = =1
300 300
24−21 3
Jim: z= = =0.6
5 5
Since these scores are standardized, we can now compare them and say that Pam’s score is 1
Percentiles
So far, we looked at z-scores to evaluate how many percent lie above or below a given value. But
what if we want to know the probability of observations between two values?
For instance, we might want to know the percentage of men not further away than 15 cm
of the average height of 178 cm. Thus, we look at all men between 163 and 193 cm.
First, we can look at the probability of men who are smaller than 193 cm, thus excluding all
those who are larger than that. The height of men is normally distributed with N (178, 10.1).
Let’s calculate the z-score and then use R to calculate probabilities:
193−178
z= =1.485149
10.1
Looking at the table I Find 0.9312478. Thus, about 93.12% of men are smaller than 193 cm.
From these 93.12%, we still need to subtract those observations that are smaller than 163 cm.
Again, we can calculate z-scores and percentiles:
163−178
z= =−1.485149
10.1
Looking at the table: 0.06875218. Thus, about 6.88% of men are smaller than 163 cm.
Thus, about 86.24% of men are between 163 and 193 cm tall.
Conversely, if we know the distribution, we can also ask what the cut-off point for a certain
percentage is. For instance, we could ask at which height 10% of observations are smaller. To do
this, we need to look up the appropriate level in a z-score table, and then plug the value into the
z-score formula.
Thus, we need to plug the value −1.28 into our z-value formula.
No matter which normal distribution we look at, the following properties apply to all of them:
about 68% falls within 1 SD of the mean,
about 95% falls within 2 SD of the mean,
about 99.7% falls within 3 SD of the mean.
It is possible for observations to fall 4, 5, or more standard deviations away from the
mean, but these occurrences are very rare if the data are nearly normal.
The normal distribution is very important because many other distributions can be approximated
by the normal:
PARAMETER ESTIMATION
We are often interested in population parameters. Since complete populations are difficult (or
impossible) to collect data on, we use sample statistics as point estimates for the unknown
population parameters of interest. Sample statistics vary from sample to sample. Quantifying
how sample statistics vary provides a way to estimate the margin of error associated with our
point estimate.
Example: List of random numbers: 59, 121, 88, 46, 58, 72, 82, 81, 5, 10
In our case, the sample mean is: x¯ = (8 + 6 + 10 + 4 + 5 + 3 + 5 + 6 + 6 + 6)/10 = 5.9
This is our population, if we sample it and find the mean of the sample multiple time we will find
slightly different means. What we see here, i.e. the variability of the sample means, is called a
sampling distribution.
A sampling distribution of a statistic is the probability distribution that specifies probabilities for
the possible values the statistic can take on.
In a sample from a numeric variable (such as in our example), the sampling distribution
will be normally distributed with the mean of the population parameter, µ.
The standard error describes the variability of the sampling distribution, and depends on
the population standard deviation, σ, and the sample size, n.
The sampling distribution represents the distribution of the point estimates based on
samples of a fixed size from a certain population. It is useful to think of a particular point
estimate as being drawn from such a distribution.
STANDARD ERROR
The standard deviation of the sampling distribution of x is called the standard error of x and is
denoted by σ x . The standard error describes the typical error or uncertainty associated with the
estimate. Given n independent observations from a population with standard deviation σ, the
standard error of the sample mean is equal to:
Note that the SE decreases as n increases ⇒ we expect more consistent sample means as n
increases, thus the variability decreases. Also note that we usually do not know σ, hence we use
the sample standard deviation s instead.
The Central Limit Theorem for random samples states that the sampling distribution is
σ
approximately a normal distribution with mean µ and standard error
√n
Certain conditions must be met for the CLT to apply:
CONFIDENCE INTERVALS
A confidence interval for a parameter is an interval of numbers within which the parameter is
believed to fall. The probability that this method produces an interval that contains the parameter
is called the confidence level.
Example: A random sample of 50 college students were asked how many exclusive relationships
they have been in so far. This sample yielded a mean of 3.2 and a standard deviation of 1.74.
Estimate the true average number of exclusive relationships using this sample.
We are 95% confident that college students on average have been in between 2.7 and 3.7
exclusive relationships.
HYPOTHESIS TESTING
We conduct a hypothesis test under the assumption that the null hypothesis is true. If the test
results suggest that the data do not provide convincing evidence for the alternative hypothesis,
we stick with the null hypothesis. If they do, then we reject the null hypothesis in favor of the
alternative.
Test statistic = first construct the distribution assumed under the Null, and then compute a z-
value of the observed value for this distribution.
P-VALUE
The p-value is the probability that the test statistic equals the observed value or an even more
extreme value in the direction predicted by H0. It is calculated presuming that HA is true.
If the p-value is low (lower than the significance level, α, which is usually 5%) we say
that it would be very unlikely to observe the data if the null hypothesis were true, and
hence reject H0.
If the p-value is high (higher than α) we say that it is likely to observe the data even if the
null hypothesis were true, and hence do not reject H0.
The most commonly used significance level in the social sciences is the 5% significance level.
⇒ the p-value at µ + 1.64x for the under the Null ⇒ the p-value at µ ±
values than µ + 1.64x under the Null between µ − 1.96x and µ + 1.96x
one-sided test is exactly 0.05 (or at z = 1.96x for the two-sided test is exactly
1.64 for the standard normal). 0.05 (or at z = ±1.96 for the standard
normal).
Cut-Off Point
At which value is the cut-off point where we would reject the H0 (assuming the SE remains the
same)? Differently stated: At which value would the p-value be exactly 0.05?
Hence, we can plug in the values 8 for the mean under Null, 0.5 for the SE, and 1.64 for the z-
value at which we know that 95% of observations are expected to be smaller in a normal
distribution.
Therefore, given the (estimated) SE, we would reject the Null at the 5% confidence level for all
values equal or greater than 8.82. That is the point were the p-value equals 0.05.
Two-Tailed
Test I If the research question was “Do the data provide convincing evidence that the mean of
our sample is different than our H0 (instead of larger or smaller)?", the alternative hypothesis
would be different.
Then, when we observe a value x from a sample, we cannot only look at one side of the
distribution to see how likely the value is. We must also consider the other tail
For the p-value, this means that if we want to use the 5% confidence level that our hypothesis
holds, we must consider both tails.
In practice, this means rejecting the Null if the sample falls in either of the tails, each of
which with 2.5% probability.
This works because the normal distribution is symmetric.
In R, we can therefore simply multiply the p-value from before by two
T-DISTRIBUTION
When we have a small sample size, it is better to be cautious and to use the more robust t
distribution instead of the normal distribution.
When working with small samples, and the population standard deviation is unknown (almost
always), the uncertainty of the standard error estimate is addressed by using a new distribution:
the t distribution.
This distribution also has a bell shape, but its tails are thicker than the normal model’s.
Therefore, observations are more likely to fall beyond two SDs from the mean than under
the normal distribution.
the standard error ⇒ we need larger t-values (not z-values any more) to reject H0
These extra thick tails are helpful for resolving our problem with a less reliable estimate
Degree of freedom
The degrees of freedom (df) describe the shape of the t distribution. The larger the degrees of
freedom, the more closely the distribution approximates the normal model. If a sample has n
observations and we examine its mean, then we use the t distribution with df = n − 1. At around a
sample size of 30 (some say 50), the t distribution is almost equal to the normal distribution.
Finding the test statistic is very similar to the standard normal case. Instead of z-values, we now
compute t-values with df = n − 1 for the sample mean.
Once we know our t-value, we can again obtain p-values either using tables.
Confidence Interval
For a study 29 people suffering from anorexia are treated with a new therapy. The study registers
the change in weight for each person receiving the therapy over two months. At the end of the
treatment period, the sample mean changed by 3.007 kilos, with a standard deviation of 7.309. Is
the treatment effective?
First, define H0 and HA, then begin by calculating the SE for the study:
7.309
SE= =1.357
√ 29
3.007−0
Next, get the t-value and the p-value (in R): =2.22
1.357
The p-value is 0.017. Thus, there is only an about 1.7% chance to observe the value of 3.007 if
the Null is correct. The value is significant at the 5% significance level.
Since we can’t find exact p-values using tables, for our hypotheses test we need to find the cut-
off value at which we start rejecting the Null. In this case, we are looking at a one-sides
hypotheses test (the new therapy has a positive effect), and we want to test the 5% confidence
level.
Thus, at t-values larger than 1.699, we reject H0. Since the value we found is 2.22, H0 is rejected
(at the 5% confidence level).
Finally, let’s construct a 95% confidence interval. Let’s find our t* value:
We are going to test to see if there is a difference between the average prices of 0.99 and 1 carat
diamonds. In order to be able to compare equivalent units, we divide the prices of 0.99 carat
diamonds by 99 and 1 carat diamonds by 100, and compare the average point prices.
Parameter of interest: Average difference between the point prices of all 0.99 carat and 1 carat
diamonds.
Point estimate: Average difference between the point prices of sampled 0.99 carat and 1 carat
diamonds.
Test statistic for inference on the difference of two small sample means is as follows:
point estimate−null value
Tdf =
SE
Where:
compute a p-value for our one-sided test comparing two means. The p-value is small at 0.01,
hence there is only about a 1% chance of drawing our samples should the Null that there is no
difference between the two diamond classes be correct. We therefore reject H0, there is enough
evidence to suggest that 0.99 carat diamonds cost less than 1 carat diamonds.
INFERENCE FOR CATEGORICAL DATA
Approval Rating
Note: So far we used sample mean x (the estimate) vs. population mean µ (the parameter).
Almost all other estimates are indicted by the hat over the symbol.
We can answer this research question about the proportion of Arstotzkans approving of their
government by using a confidence interval, which we know is always of the form
point estimate ± margin of error
And we also know that ME=critical value × standard error of the point estimate . So what is the
SE of a sample proportion?
Note: If p is unknown (as in most cases), we use ^pin the calculation of the standard error
But of course, this is true only under certain conditions: Independent observations (of less than
10% of the population). At least 10 successes and 10 failures.
We are given that n = 670, pˆ = 0.85, and we also just learned that the standard error of the
sample proportion is SE. Thus, we can construct confidence intervals:
The survey found that 571 out of 670 (85%) respondents have a favorable view of their
government. Do these data provide convincing evidence that more than 80% of Arstotzkans
approve of their government?
Since the p-value is low, we reject H0. The data provide convincing evidence that more than
80% of Arstotzkans value the work of their government. Important: For this hypothesis test we
536 ⇒ OK).
need at least 10 expected success and failures in the sample (670 × 0.2 = 134 and 670 × 0.8 =
DIFFERENCE OF PROPORTIONS
A survey asked people from the general population and then specifically students whether they
were bothered by global warming. Here is the result of the survey:
pstudends − p citizens
Point estimate: Difference between the proportions of sampled students and sampled ‘normal’
citizens who are bothered a great deal by global warming.
^pstudends − ^p citizens
Construct a 95% confidence interval for the difference between the proportions of students and
citizens who are bothered a great deal by global warming
Do these data suggest that the proportion of all students who are bothered a great deal by
warming differs from the proportion of all citizens who do?
The question now is what is pˆ here? Remember that with one proportion we used the expected
proportion of H0 and HA. We don’t have this here, instead we use the pooled proportion of
successes in the sample. This is like saying that the two are essentially the same.
Type 2 error is failing to reject H0 when you should have, and the probability of doing so is β (a
little more complicated to calculate)
Power of a test is the probability of correctly rejecting H0, and the probability of doing so
is 1 − β
In hypothesis testing, we want to keep α and β low, but there are inherent trade-offs.
2. Decrease the standard deviation of the sample, which essentially has the same effect as
increasing the sample size (it will decrease the standard error). With a smaller s we have
a better chance of distinguishing the null value from the observed point estimate. This is
difficult to ensure but cautious measurement process and limiting the population so that it
is more homogenous may help.
3. Increase α, which will make it more likely to reject H0 (but note that this has the side
effect of increasing the Type I error rate).
4. Consider a larger effect size, change the hypothesis.
Real differences between the point estimate and null value are easier to detect with larger
samples. However, very large samples will result in statistical significance even for tiny
differences between the sample mean and the null value (effect size), even when the difference is
not practically significant. This is especially important to research: if we conduct a study, we
want to focus on finding meaningful results (we want observed differences to be real, but also
large enough to matter). The role of a statistician is not just in the analysis of data, but also in
planning and design of a study.
We say there is an association between two variables, if the distribution of the dependent
variable changes in some way as the value of the independent variable changes.
To identify the explanatory variable in a pair of variables, carefully investigate which of the two
is suspected of affecting the other. In the social sciences, we usually do this by using (or
developing) theory.
Thus, we derive a hypothesis: explanatory variable might affect →response variable
Labeling variables as explanatory and response does not guarantee the relationship between the
two is actually causal, even if there is an association identified between the two variables.
Observational study: Researchers collect data in a way that does not directly interfere with how
the data arise, i.e. they merely “observe", and can only establish an association between the
explanatory and response variables.
Scatterplot
Scatterplots are useful for visualizing the relationship between two numerical variables. Do life
expectancy and GDP per capita appear to be associated or independent? I They appear to be
linearly and positively associated: as GDP increases, life expectancy also increases
The scatterplots exhibit a tendency of the data that the two variables are associated. For instance,
nations with higher GDP tend to have higher life expectancies. The correlation describes this
trend numerically in a single number.
Correlation, which always takes values between -1 and 1, describes the strength of the linear
relationship between two variables. We denote the correlation by R. The correlation for the
observations (x1, y1), (x2, y2), ...,(xn, yn) has the following formula:
For each observation, both the xi and the yi values are standardized and then multiplied. So if
observations tend to increase/decrease together, they will have (on average) standardized values
for x and y with the same sign. As a consequence, the sum of these multiplied standardized
values will increase, and the correlation will be relatively large.
positive, sometimes negative (by random chance), and the sums will tend to cancel out ⇒ the
If they do not increase/decrease together, the standardized values for x and y are sometimes
where I xi and yi are the values of the dependent and the independent variable for
observation i = 1, ..., n,
ei is the error term for observation i = 1, ..., n (also called residuals),
β0 and β1 are the model coefficients we want to estimate.
From this formula we want to estimate the best fit for β0 and β1, given the data. These estimates
are then called ^β 0(beta-zero hat) and ^β 1 (beta-one hat)
Fitted value: Also called predicted value. The expected value for an observation i at its given
value xi.
To obtain the best linear unbiased estimator (BLUE), we need to find the line with the smallest
sum of squared residuals (SSR).
n
Mathematically we want ¿ minimize: ∑ e2i
i=0
Intercept
The intercept is where the regression line intersects the y-axis. The calculation of the intercept
uses the fact the regression line always passes through (¯x, y¯).
Since there are no states in the dataset with no HS graduates, the intercept is of no interest, not
very useful, and also not reliable since the predicted value of the intercept is so far from the bulk
of the data.
Using the linear model to obtain the value of the response variable for a given value of the
explanatory variable is called prediction. We simply plug in the value of interest for x in the
linear model equation. I There will be some uncertainty associated with the predicted value.
Model Assumptions
GOODNESS OF FIT: R2
The strength of the fit of a linear model is most commonly evaluated using R2. I It tells us what
percent of variability in the response variable is explained by the model. The remainder of the
variability is explained by variables not included in the model or by inherent randomness in the
data. For the model we’ve been working with, R2=−0.622=0.38 . Interpretation: 38% of the
variability in the % of residents living in poverty among the 51 states is explained by the model.
How well does our model explain the total variation of the dependent variable? After running a
regression, there is a part of the variation that is explained, and the rest (the residuals).
INFERENCE FOR LINEAR REGRESSION
1. Additional 10 points in the biological twin’s IQ is associated with additional 9 points in the
foster twin’s IQ, on average.
2. Roughly 78% of the foster twins’ IQs can be accurately predicted by the model.
If H0 is rejected, we will conclude that a statistically significant relationship exists between the
two variables. However, if H0 cannot be rejected, we will have insufficient evidence to conclude
that a significant relationship exists.
HA: The intelligence of the biological twin does predict the foster twin’s IQ (nature).
H0: The intelligence of the biological twin does not predict the foster twin’s IQ (nurture).
We always use a t-test in inference for regression. We use the usual (by now known) formula to
obtain the t-statistic:
Maybe the most important assumption of our OLS models is that the relationship between y and
x is accurately described by a line. Using this assumption allows us to:
1) Characterize the relationship between x and y with a single number.
2) Easily interpret the marginal effect of x (the effect of a one-unit increases in x).
However, if the strong assumption of linearity is wrong, our results are wrong and meaningless.
Thus, we need to carefully assess the linearity assumption. Check using a scatterplot of the data,
or a residuals plot with the independent variable on the x-axis and the residuals on the y-axis.
The next assumption of the OLS model is that the residuals are (nearly) normally distributed.
Thus, at every point of x the residuals should follow a normal distribution. This condition may
not be satisfied when there are unusual observations that don’t follow the trend of the rest of the
data. Check using a histogram or normal probability plot of residuals.
The variability of points around the least squares line
should be roughly constant. This implies that the
variability of residuals around the 0 line should be
roughly constant as well. Also called homoscedasticity.
Check using a histogram or normal probability plot of
residuals.
Sometimes we include more than a single regressor (independent variables) in the model. This is
called a multiple linear regression (often simply called multiple regression).
Weights of Books
Slope of volume: All else held constant, books that are 1 more cubic centimeter in
volume tend to weigh about 0.72 grams more.
Slope of cover: All else held constant, the model predicts that paperback books weigh
184 grams lower than hardcover books.
Intercept: Hardcover books with no volume are expected on average to weigh 198 grams.
Obviously, the intercept does not make sense in context. It only serves to adjust the
height of the line.
Dummy Variables
The cover type variable has only two categories (hard cover or paperback). We call such a
variable a dummy variable. Dummy variables are categorical variables in binary form (entries
are either 0 or 1). Thus, in our example, one cover type will get the value 0, the other the value 1.
The benefit of such zero-one variables is that they lead to regression models with easy
interpretation. Advise: When creating your own dummy variables, name your variable after the
1-category (e.g. “female”). This way it is clear later on which category is which!
If we have a variable with multiple categories, then we get multiple estimates that change the
intercept. We need to include one dummy variable for every category (save the base category),
i.e. a variable that takes the value 1 if the observations is in that category, and 0 otherwise. The
intercept for the base category is the ‘normal’ intercept. Including an extra dummy for the base
category is called the dummy variable trap.
Example: If there are four political parties and
we create dummies for a regression for them,
they look as follows:
Two predictor variables are said to be collinear when they are correlated, and this collinearity
complicates model estimation. Predictors are also called explanatory or independent variables.
Ideally, they would be independent of each other.
Redundancy: Addition of such variables brings nothing to the table. Instead, we prefer the
simplest model, i.e. parsimonious model.
Biased estimates: If correlated we don’t know what part of the variance in y is caused by x1 or
x2. While it’s impossible to avoid collinearity from arising in observational data, experiments are
usually designed to prevent correlation among predictors: e.g., traffic accidents (y) = age (x1) +
legal drinking (x2)
Maybe compare fr. of accidents for people just below/above the legal age?
Adjusted R2
So, if adding variables to the model might cause problems, why should we add variables to our
models (and not test them one by one)? There are three reasons why we might want to do this:
1. We want to maximize predictive power
2. We have more than one hypotheses of interest.
3. There exists a variable that influences y and is correlated with x (omitted variable bias).
e.g., the effect of education on income and the role played by motivation in both
Thus, the error term ei captures all the variance of y that is still unexplained by the model. This
unexplained variance has two parts
Some of the variance can be explained by other (measurable) factors.
However, if any of the influential variables (that are captured in the error term) is correlated with
our independent variable of interest, then the regression result will be systematically biased! We
need to have a theoretically driven plan (assumptions): causal graphs (DAGs)