Hypothesis Test Errors
Hypothesis Test Errors
• Here the ‘density’, ‘block’, and ‘fertilizer’ listed as categorical variables with the
number of observations at each level (i.e. 48 observations at density 1 and 48
observations at density 2).
• ‘Yield’ should be a quantitative variable with a numeric summary (minimum,
median, mean, maximum).
Performing ANOVA
One.way<-aov(yield ~ fertilizer,data=crop.data
Summary(one.way)
The model summary first lists the independent variables being tested in the model (in this case we have only
one, ‘fertilizer’) and the model residuals (‘Residual’).
All of the variation that is not explained by the independent variables is called residual variance.
• The rest of the values in the output table describe the independent variable and the residuals:
• The Df column displays the degrees of freedom for the independent variable (the number of
levels in the variable minus 1), and the degrees of freedom for the residuals (the total number
of observations minus one and minus the number of levels in the independent variables).
• The Sum Sq column displays the sum of squares (also known as the total variation between
the group means and the overall mean).
• The Mean Sq column is the mean of the sum of squares, calculated by dividing the sum of
squares by the degrees of freedom for each parameter.
• The F value column is the test statistic from the F test. This is the mean square of each
independent variable divided by the mean square of the residuals. The larger the F value, the
more likely it is that the variation caused by the independent variable is real and not due to
chance.
• The Pr(>F) column is the p value of the F statistic. This shows how likely it is that the F value
calculated from the test would have occurred if the null hypothesis of no difference among
group means were true.
• The p value of the fertilizer variable is low (p < 0.001), so it appears that the type of fertilizer
used has a real impact on the final crop yield.
Two-way ANOVA
• In the two-way ANOVA example, we are modeling crop yield as a function of
type of fertilizer and planting density.
• First we use aov() to run the model, then we use summary() to print the
summary of the model.
• two.way<-aov(yield ~ fertilizer+density,data=crop.data
• Summary(two.way)
• Adding planting density to the model seems to have made the model better:
• It reduced the residual variance (the residual sum of squares went from 35.89 to 30.765), and both
planting density and fertilizer are statistically significant (p-values < 0.001).
Kruskal-wallis test
• A Kruskal-Wallis test is used to determine whether or not there is a statistically significant
difference between the medians of three or more independent groups.
• This test is the nonparametric equivalent of the one-way ANOVA and is typically used when the
normality assumption is violated.
• The Kruskal-Wallis test does not assume normality in the data and is much less sensitive to
outliers than the one-way ANOVA.
• Here are a couple examples of when you might conduct a Kruskal-Wallis test:
• Example 1: Comparing Study Techinques
• You randomly split up a class of 90 students into three groups of 30. Each group uses a different
studying technique for one month to prepare for an exam.
• At the end of the month, all of the students take the same exam. You want to know whether or
not the studying technique has an impact on exam scores.
• From previous studies you know that the distributions of exam scores for these three studying
techniques are not normally distributed so you conduct a Kruskal-Wallis test to determine if there
is a statistically significant difference between the median scores of the three groups.