ABD Formulas
ABD Formulas
6 shex̅
6
Central Limit Theorem X will have an approximately normal distribution if n is large enough (typically 30 is enough
for this)
Proportion Pb ⇣is the
⌘ random variable of the sampling distribution of the proportion
E P =pb where p is the proportion in the population
⇣ ⌘
p(1 p)
V ar Pb = n
Likelihood For data x1 , . . . , xn from a distribution with some unknown parameter ✓:
Likelihood Ratio If ✓ is the value from a hypothesis, and ✓ˆ is the maximum likelihood estimate, then:
L(✓)
LR =
ˆ
L(✓)
✓ ◆
b = 2 (`1 L(✓1 )
Likelihood Ratio Test (LRT) ⇤ `0 ) = 2 loge
L(✓0 )
where ✓1 is the parameter from the more complicated model and ✓0 is from the simpler model.
b ⇠ 2 where df is the di↵erence in the degrees of freedom for the two models.
⇤ df
Sample size Generally good to have sample size at least 30 for the Central Limit Theorem (more if data are
badly skewed), but you also might care about other things:
Confidence Intervals If given a margin of error (= distribution value ⇥ variability)
Use table below (last page of these notes) to get a formula which includes an n. . . then solve for
n.
1-sample t-test ⇣ ⌘2
z1 ↵ + z1 2
2
n 2
Where ↵ is the significance level, 1 is the power, 2 is the variance and is the di↵erence
to be detected.
2-sample t-test Assuming equal variances:
⇣ ⌘2
2 z1 ↵ + z1 2
2
n 2
Where ↵ is the significance level, 1 is the power, 2 is the variance and is the di↵erence
to be detected. Note that this is the number for each group, so the total sample size is 2n.
1-proportion test
p0 (1 p0 )(z1 ↵
2
)2 + p1 (1 p1 )(z1 )2
n 2
Where ↵ is the significance level, 1 is the power, is the di↵erence to be detected, p0 is the
null hypothesis proportion and p1 is the value in the range (p0 , p0 + ) which is closest to
0.5.
Generalised Linear Models All models have a deterministic part which gives an equation involving the predictors, and a
stochastic part which incorporates the “randomness” in data (often called the “error distribu-
tion”).
Linear Model
2
yi = 0 + 1 x1i + 2 x2i + ··· + k xki +"i "i ⇠ N (0, )
Uses Normal error distribution (mean zero and constant variance). Often called “(Multiple)
Regression” when all predictors are numerical, “ANOVA” when all predictors are categorical,
and “ANCOVA” when there is a mixture.
Logistic Regression
logit(⇡i )= 0 + 1 x1i + 2 x2i + ··· + k xki yi ⇠ Bern(⇡i )
Bernoulli (or Binomial) model for randomness. Uses the logit link function to transform linear
equation into probabilities (restricted to between 0 and 1).
✓ ◆
⇡i eT
logit(⇡i ) = log inverse logit: ⇡i = if T = logit(⇡i )
1 ⇡i 1 + eT
Poisson Regression
log( i )= 0 + 1 x1i + 2 x2i + ··· + k xki yi ⇠ Pois( i )
Poisson model for randomness. Uses the log link function to transform linear equation into
positive values.
Hypothesis testing: is the sample result consistent with the hypothesised value of the parameter?
2. What sort of estimates would we expect if H0 is true? (assuming H0 is true, what does this tell us about the distribution
of the sample estimates we could expect?)
3. How extreme is our sample estimate? Is our sample result close to what we expected if H0 is true? (calculate test
statistic/p-value)
estimate H0 value
test statistic =
variability
4. Decision? (reject or retain H0 ) Is our sample result too extreme/unusual? (compare sample results to critical
value/significance level)
µ x̄ z sd(X̄) = p
n
when is known.
µD d¯ tnp 1 se(D̄) = sD
p
np paired data. First calculate di↵erences (D) then same as above.
np = number of pairs.
q
1 1
µ1 µ2 x̄1 x̄2 tdf se(X̄1 X̄2 ) = spooled ⇥ n1 + n2 2 independent
q samples, unpaired, assuming 1 = 2 .
(n1 1)s1 2 +(n2 1)s2 2
spooled = n1 +n2 2 and df = n1 + n2 2; or use
p
spooled = M Serror and df = dferror from ANOVA.
q
p(1 p)
p p̂ z sd(Pb) = n for hypothesis tests (use H0 p for standard error).
q
p̂(1 p̂)
p p̂ z se(Pb) = n for confidence intervals (no hypothesis value, so use sample p̂).
⇣ ⌘ ⇣ ⇣ ⌘⌘ q
log (OR) d
log OR z d
se log OR = a1 + 1
+ 1
+ 1
using the cell values a, b, c, d to calculate the standard error. Inter-
b c d
val needs to be back-transformed (elower , eupper ) after calculating
the interval.
Can also get this from a logistic regression (GLM with logit
link). . . R will calculate the estimate and standard error.
ˆ tdf se( ˆ) Linear models (works for any parameter in the model). Use match-
ing se from R. df = n #(parameters in model) = dferror .
t student distribution
tech i
normal nucm 2 Z 0,1
unknown
opulationmean is known t population standard deviation is
spulation
standard deviation is know d known s
standard deviation thesample is
has a distributionvalue 11.96
of
95 confident level Has a distribution value based on the data
called fcritical notethat it is not 1.96
t critical qt 0.975 n t
t.no s i.gg
I Evarybased
we can
standardise
the value by
f fon the
x̅ is the samplevalue
If where context
SEis thestandarderror I
CI x̅ tential SE
norm 0.975 1.96
SE
1.95
0.025
Porportion distribution Bi n p
j P sample proportion
syless
pct I
than30
ge greater
commonly
CI I 1.96 X SE
E
standarddeviation
H lt
tdhffdvaine
of
Chisq distribution
Point estimation
likelihood
test square
data themodel
of given
distance
probability
Ss sum square
of
The model has a PDF Pmf
Sa EI 05
number
LCA PLX A
estimate
estimator distribution P A Pex A pan a
he bes estimate
x px
Pex a pixn x
we the likelihood to
maximizing
the population
ANOVA twoway ANOVA
How visualizing
plot
box F test
normally distributed
summary numerical t numerial
categorial
random treatments
assigned
Independent
varibles ANOVA
numerial numerial
linear
regression
Categorical categorical
expected
Chisquare
between variables
Independency categorical
we compare proportion
and categoricalvariables
observed value theori
or against any
alne a Chisquare test
using
eg w um um um
Expertedfrequency
3.84
fill
q 9241112 9123839 I 20233839 0.501
expected
contribution p q
UM 04989 10.5 1
3839 1932.599 P2
mum 0.51 5839 988.52
odd ratio
relative risk
f
PCA4 PA a
A
RR 1 PIA
1 PLA42
A a the probability obtaining feature
ORCA OCA 41
of peaas
1 giventhat thegroup is 4 O CAGa
normal distribution
R s 1.0016
Lyok follows a
are
thy
for red head
if
we RR
SEl.g.ie ÉÉi abind are the
counts in eachcell
PCR FI
for CI 1.96
of the contingent
table
PCR M
Lyirt SEgin
RR I Ho is accepted no
if
saliation between 2 variables
log
RR follows a normaldistribution
I SE
lyrr 1.96
t.gr
we can unly it by using exponential
µ
one sample t test
wehave list data and
a
of
nt to check the difference between
le mean and themeanif interestis
of t
giant we use onesample test
4
3 mean is 4
2 4 Ie sample
IT sample meanis
greater
than
or equalto 4
x̅ 4.8333
5 5 3 8333
4548
y
1
If 18 8
2.385
SE 2.385
0.9136 d in I
t 4 4
0.856
4 DP 0.05
Pt 0.856dy5 lowertailFALSE
0.22
paired t test
group A and B YA
2 3
5 2 ys 2 in paired data n is always
2.109
5 3.8 5 7.4 SE f 0.94
Sp Sx 54 2.109 to 3.8 1.4
3.83
0.94
186
enough get
2ft 0
the null hypothesis and t.fr is
significant
unpaired t test
I SE
t.EE
spool
d nitna 2
51 1.01 spot
5 3.22
x̅ 3.4
É6
f
1313.2212
5 5 spot
5 6 2
5 3.4 2.65
3.26
0.49
SE 2 65 555 3.26
0.05
Pt 0.3179
2Pt 0.6358
2024-05-10
Instructions
• Assignment 3 contains 2 problems worth a total of 32 marks.
Pollinators dataset will be used to answer Problem 1. This dataset contains information on
two species of Sea Rocket Cakile maritima (Maritima) and Cakile edentula (Edentula), and
their first generation hybrid plants (F1 Hybrid). Cakile sp. is an invasive weed in Australia
and we are interested in investigating the interactions between these species and their insect
pollinators. We collcted data on the number of flowers plants produced over time, and we are
intersted to see whether there is a correlation between the number flowers produced by each
plant and the number of pollinators visited the plants.
The file name for this dataset is “pollinators.csv”. There is one row for each individual plant. Column Species
is the species of the plant; NumberofFlowers is the number of open flowers on each plant; NumberofPollinator
is the number of insect pollinators visited that plant; Date is the day data were recorded.
Download and save the data in your Working Directory.
Flower <- read.csv("pollinators.csv")
1
80
Pollinator frequency
60
40
20
0
Plant species
#scatter plot (1 mark)
plot(Flower$NumberofFlowers, Flower$NumberofPollinator, main = "Scatterplot")
Scatterplot
Flower$NumberofPollinator
80
60
40
20
0
0 50 100 150
Flower$NumberofFlowers
b. [1 Marks] Fit a linear model to the response variable (Number of Pollinator) and predictor (Number
of Flowers). Add the fitted regression line from this linear model to the scatter plot you made in part a.
pol_lm <- lm(NumberofPollinator ~ NumberofFlowers, data = Flower)
plot(Flower$NumberofFlowers, Flower$NumberofPollinator, main = "Scatterplot")
abline(pol_lm, col = "red")
2
Flower$NumberofPollinator Scatterplot
80
60
40
20
0
0 50 100 150
Flower$NumberofFlowers
c. [1 Marks] Is there a strong correlation between the number of flowers each plant produced and the
number of pollinators visited them? Why?
#Problem 1 _ Correlation and lm
cor(Flower$NumberofFlowers, Flower$NumberofPollinator)
## [1] 0.9137602
[ANSWER] Yes, there is a strong positive correlation between the number of flowers each plant produced
and the number of pollinators visited them, as 91.37 percent of increase in number of pollinators can be
explained by the increase in the number of flowers.
d. [2 Marks] We are interested to see whether there is an effect of flower number and plant species on
the number of pollinators visited these plants. Write down the most appropriate statistical test to see
the effect of these two predictors on the number of pollinators. Clearly interpret the output of the
statistical test.
#Problem 1 _ ANCOVA
pollinator_lm <- lm(NumberofPollinator ~ Species + NumberofFlowers, data = Flower)
summary(pollinator_lm)
##
## Call:
## lm(formula = NumberofPollinator ~ Species + NumberofFlowers,
## data = Flower)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.574 -4.711 0.141 3.943 30.777
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
3
## (Intercept) -3.94327 0.99138 -3.978 9.26e-05 ***
## SpeciesF1 Hybrid 9.54214 1.38448 6.892 4.97e-11 ***
## SpeciesMaritima 8.48277 1.36546 6.212 2.34e-09 ***
## NumberofFlowers 0.58455 0.01661 35.187 < 2e-16 ***
## ---
## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
##
## Residual standard error: 8.473 on 236 degrees of freedom
## Multiple R-squared: 0.867, Adjusted R-squared: 0.8653
## F-statistic: 512.8 on 3 and 236 DF, p-value: < 2.2e-16
anova(pollinator_lm)
.
e. [2 Marks] Is there any significant interaction between species and flower number? Fit a model to test
the interaction between the predictors. Clearly interpret the output of the model.
#Problem 1 _ interaction
model_interaction <- lm(NumberofPollinator ~ Species * NumberofFlowers, data = Flower)
summary(model_interaction)
##
## Call:
## lm(formula = NumberofPollinator ~ Species * NumberofFlowers,
## data = Flower)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.539 -2.721 -1.373 2.909 27.883
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.72059 0.91475 2.974 0.00325 **
## SpeciesF1 Hybrid -0.39984 1.49349 -0.268 0.78915
## SpeciesMaritima -0.38259 1.29686 -0.295 0.76824
## NumberofFlowers 0.20565 0.03177 6.474 5.56e-10 ***
## SpeciesF1 Hybrid:NumberofFlowers 0.46382 0.03991 11.623 < 2e-16 ***
## SpeciesMaritima:NumberofFlowers 0.44469 0.03599 12.355 < 2e-16 ***
## ---
## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
##
4
## Residual standard error: 6.478 on 234 degrees of freedom
## Multiple R-squared: 0.9229, Adjusted R-squared: 0.9213
## F-statistic: 560.3 on 5 and 234 DF, p-value: < 2.2e-16
anova(model_interaction)
.
f. [2 Marks] Does your model (model in question 1e) meet the assumptions? Why?
#Problem 1 _ assumption check
plot(model_interaction, which = c(1,2))
Residuals vs Fitted
30
211
237
213
20
10
Residuals
0
−20 −10
0 20 40 60 80 100
Fitted values
lm(NumberofPollinator ~ Species * NumberofFlowers)
5
Q−Q Residuals
211
237
4
Standardized residuals
2
0
−2
226
−3 −2 −1 0 1 2 3
Theoretical Quantiles
lm(NumberofPollinator ~ Species * NumberofFlowers)
[ANSWER] It seems that the model meets the assumption of residuals normal distribution. we can use
residual vs fitted value plot and QQ residuals plot to check the distribution of errors in a model. The residual
vs fitted value plot shows almost equal error variances and residuals seem to be randomly distributed. The
QQ residuals plot shows the distribution of error in the model; the error distribution look almost normal.
However, it is skewed in both tails. Note: for this question, we would accept either (yes or no) answer, as
long as you explain correct reasoning.
g. [3 Marks] Use the function anova() in R to compare your models (models with and without interaction)
in part d and e using a likelihood ratio test (LRT). State the null hypothesis for this test, the distribution
for the test statistic under the null hypothesis (specify also the degrees of freedom), and give a p-value.
# Problem 1 _ Model comparison
anova(pollinator_lm, model_interaction, test = "LRT")
6
parameters (interaction terms) in Model 2 compared to Model 1. Given the very small p-value (< 2.2e-16),
we reject the null hypothesis. This means that adding the interaction term (Species * NumberofFlowers)
significantly improves the model fit. The interaction between Species and NumberofFlowers is important for
predicting the number of pollinators.
h. [4 Marks] We would like to know whether there is any significant difference between the number of
flowers each species produced or not. Run an appropriate statistical test to test if there is any significant
difference between the number of flowers each species (Maritima, Edentula and F1 Hybrid) produced.
Clearly interpret the output of the statistical test. Then, run a post hoc TUKEY test to see which two
species produced significantly higher number of flowers? To run a Tukey post hoc test, use TukeyHSD()
function to find confidence intervals for the difference between means for pairs of species, and use this
to see which two species produced statistically significant different number of flowers. Clearly interpret
the output of the post hoc test.
ANOVA_1 <- aov(NumberofFlowers ~ Species, data = Flower) #1mark
summary(ANOVA_1) #1 mark for interpretation of the summary
7
a. [3 Marks] Fit a linear model to test the difference between mean mortality of these aphid populations
when they were exposed to different concentrations of the pesticide. Check the residual (error)
distribution. Does your model provide a good fit to the data? Why or why not?
#lm with no transformation
Model1 <- lm( data= aphid, mort ~ conc + pop)
summary(Model1)
##
## Call:
## lm(formula = mort ~ conc + pop, data = aphid)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.44470 -0.21604 -0.04101 0.16218 0.57847
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.385419 0.044358 8.689 3.7e-15 ***
## conc 0.064452 0.006263 10.291 < 2e-16 ***
## popForbes5 0.052837 0.061136 0.864 0.389
## popKyabram98 0.049634 0.061136 0.812 0.418
## popOsborne171 -0.028344 0.061136 -0.464 0.644
## ---
## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
##
## Residual standard error: 0.2802 on 163 degrees of freedom
## Multiple R-squared: 0.3994, Adjusted R-squared: 0.3847
## F-statistic: 27.1 on 4 and 163 DF, p-value: < 2.2e-16
plot(Model1, which = c(1,2))
Residuals vs Fitted
0.6
15831
34
0.4
0.2
Residuals
0.0
−0.4
Fitted values
lm(mort ~ conc + pop)
8
Q−Q Residuals
158
3134
2
Standardized residuals
1
0
−1
−2 −1 0 1 2
Theoretical Quantiles
lm(mort ~ conc + pop) [AN-
SWER] No, the model does not seem a good fit as the Residual vs Fitted value plot and QQ Residuals plot
do not show normal distribution of errors.
b. [3 Marks] Fit another linear model and log transform the concentration of the pesticide in this model.
Check the residual (error) distribution. Does your model provide a good fit to the data? Why or why
not?
#lm with log transformation for pesticide concentration
Model2 <- lm( data = aphid, mort ~ I(log(conc)) + pop)
summary(Model2)
##
## Call:
## lm(formula = mort ~ I(log(conc)) + pop, data = aphid)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.70036 -0.07549 0.01228 0.09352 0.38801
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.807313 0.026846 30.072 <2e-16 ***
## I(log(conc)) 0.069398 0.002607 26.620 <2e-16 ***
## popForbes5 0.052837 0.033957 1.556 0.122
## popKyabram98 0.049634 0.033957 1.462 0.146
## popOsborne171 -0.028344 0.033957 -0.835 0.405
## ---
## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
##
## Residual standard error: 0.1556 on 163 degrees of freedom
## Multiple R-squared: 0.8147, Adjusted R-squared: 0.8102
## F-statistic: 179.2 on 4 and 163 DF, p-value: < 2.2e-16
9
plot(Model2, which = c(1,2))
62
Residuals
−0.4
26
67
−0.8
Fitted values
lm(mort ~ I(log(conc)) + pop)
Q−Q Residuals
62
2
Standardized residuals
0
−2
26
−4
67
−2 −1 0 1 2
Theoretical Quantiles
lm(mort ~ I(log(conc)) + pop) [AN-
SWER] Yes, the model appears to be a good fit as both Residual vs Fitted value plot and QQ Residuals
plot show residuals follow a normal distribution after logarithm transformation. Logarithm transformation
can stabilise variance and make the distribution of residuals more normal. This transformation is particularly
useful when dealing with skewed data or heteroscedasticity (non-constant variance). This means the model is
more likely to meet the assumptions of linear regression, leading to more accurate predictions and valid
10
statistical inferences.
c. [3 Marks] Fit a generalized linear model. What distribution family would your choose for this model?
Why? Check the residual (error) distribution. How has the distribution of the error changed compare
to the two previous models?
#glm
Model3 <- glm( data = aphid, mort ~ I(log(conc)) + pop, weights=total,
family = binomial(link = "logit"))
summary(Model3)
##
## Call:
## glm(formula = mort ~ I(log(conc)) + pop, family = binomial(link = "logit"),
## data = aphid, weights = total)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.71045 0.15503 11.033 <2e-16 ***
## I(log(conc)) 0.39314 0.01861 21.130 <2e-16 ***
## popForbes5 0.39088 0.18635 2.098 0.0359 *
## popKyabram98 0.35505 0.18656 1.903 0.0570 .
## popOsborne171 -0.17319 0.18355 -0.944 0.3454
## ---
## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1020.8 on 167 degrees of freedom
## Residual deviance: 254.5 on 163 degrees of freedom
## AIC: 546.66
##
## Number of Fisher Scoring iterations: 4
plot(Model3, which = c(1,2))
11
Residuals vs Fitted
4
2 128
3
Pearson Residuals
0
−2
−4
−6
67
−3 −2 −1 0 1 2 3
Predicted values
glm(mort ~ I(log(conc)) + pop)
Q−Q Residuals
6
67
5
Std. Deviance resid.
26
62
3
2
1
0
Theoretical Quantiles
glm(mort ~ I(log(conc)) + pop)
[ANSWER] The distribition is binomial as we recorded the death or survival of aphids after exposure to
different doses of pesticide.
d. [2 Marks] Which model was the most appropriate model to test whether there are differences between
mean mortality of the aphid populations? Explain why you chose this model. Explain and interpret
the summary of this model.
12
[ANSWER] Both models show similar normality in the Q-Q plots, indicating that the residuals are reasonably
normally distributed for both models. This suggests that both models meet the normality assumption well.
Therefore, both models are equally good. **If someone has chosen onyl one model, any of these reasoning are
acceptable: The GLM is a better model since the data are bionomial. The LM is a better model as it is a
simpler model compare to the GLM.**
e. [4 Marks] Plot the scatter plot of response (Y axis) vs predictor (X axis), with groups (color and/or
different symbols may help here). Add the fitted regression lines from your model. Ensure your graph
is on the original scale and clearly labelled.
[ANSWER] We added the graph after log transformation too, so you can actually compare the graphs
before and after the transformation. Both graphs are acceptable. Note: you must color code the graph using
aphid populations because we want to see how populations’ mortality differs at each concentration level.
library(ggplot2)
13
1.00
0.75
Aphid mortality (%)
Aphid population
Boggabilla209
0.50 Forbes5
Kyabram98
Osborne171
0.25
0.00
14
1.00
0.75
Aphid mortality (%)
Aphid population
Boggabilla209
0.50 Forbes5
Kyabram98
Osborne171
0.25
0.00
−12 −8 −4 0
log( pesticide concentration)
15
Assignment 2_Solutions
TJ
2024-05-03
1
might have better dexterity. Also, higher levels of physical activity are associated with better overall health,
including muscle strength and neurological health, which can positively impact motor skills. Including these
variables in the study design can provide deeper insights into how dexterity varies not only between dominant
and non-dominant hands but also across different demographic segments. To incorporate age and physical
activity, we can group (blocking) the sample to include a balanced mix of ages and level of physical activity
every individual does. This approach allows us to control for the effects of age and physical activity or to
explore how these factors interact with hand dominance in affecting dexterity.
The following commands will reorganise the data in a way helpful to answering Problem 2.
# calculate the average nGrains across replicates for each user based on their "dominant hand" and "hand
dce2.agg <- aggregate(dce2, . ~ UserID+Hand+dominantHand, mean, na.action = na.omit)
# Select L and R trials and calculate differences
rTrials <- subset(dce2.agg, dce2.agg$Hand=="R")
rTrials <- rTrials[order(rTrials$UserID), ]
lTrials <- subset(dce2.agg, dce2.agg$Hand=="L")
lTrials <- lTrials[order(lTrials$UserID), ]
RL <- rTrials$nGrains-lTrials$nGrains # Calculates average difference (right-left)
dom.hand <- rTrials$dominantHand # Dominant hand of each person
# Creates a new data frame (dexterity.data) with the relevant variables calculated above
dexterity.data <- data.frame(user = rTrials$UserID, rHand = rTrials$nGrains,
lHand = lTrials$nGrains, difference = RL, dominantHand=dom.hand)
a. [3 marks] Use an appropriate plot to examine if the variable difference (calculated in the code above)
is normally distributed or not. Examine the difference separately for both right and left dominant hand
groups. (NOTE: You should make two plots and label them appropriately.)
ANSWWER
# Subset
RDH <- dexterity.data[dexterity.data$dominantHand == "Right",]
LDH <- dexterity.data[dexterity.data$dominantHand == "Left",]
hist(RDH$difference)
2
Histogram of RDH$difference
60
50
40
Frequency
30
20
10
0
0 10 20 30 40
RDH$difference
hist(LDH$difference)
Histogram of LDH$difference
4
3
Frequency
2
1
0
−12 −10 −8 −6 −4 −2 0 2
LDH$difference
par(mfrow=c(1,2))
qqnorm(RDH$difference, main="QQ-plot of difference \n Right-dominant individuals")
qqline(RDH$difference, col="deeppink3")
3
qqnorm(LDH$difference, main="QQ-plot of difference \n Left-dominant individuals")
qqline(LDH$difference, col="green")
0
30
−2
Sample Quantiles
Sample Quantiles
−4
20
−6
10
−8
−10
0
## [1] 1.936863e-13
# These two commands are doing exactly the same thing. So, either is fine to run a paired t-test!
t.test(RDH$rHand, RDH$lHand, paired = TRUE)
4
##
## Paired t-test
##
## data: RDH$rHand and RDH$lHand
## t = 8.5346, df = 97, p-value = 1.937e-13
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## 3.351724 5.382970
## sample estimates:
## mean difference
## 4.367347
t.test(RDH$difference)
##
## One Sample t-test
##
## data: RDH$difference
## t = 8.5346, df = 97, p-value = 1.937e-13
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 3.351724 5.382970
## sample estimates:
## mean of x
## 4.367347
H0 : µdiff = 0
HA : µdiff ”= 0
Null hypothesis: there is no difference in the mean number of grain between the right and left hands among
right-hand dominant individuals. Alternative hypothesis: there is a difference in the mean measurement
between the right and left hands among right-hand dominant individuals. There is strong evidence of a
difference in rice grain measurements of dexterity in Right-hand dominant individuals (*t(df=97) = 8.53, p=
1.937e-13*). On average, students moved between 3.35 and 5.38 more grains with their right hand.
c. [2 marks] Explain why it is reasonable to conduct a hypothesis test (t-test) for the mean difference for
Right-hand dominant individuals, even if they are not normal.
ANSWER The Central Limit Theorem (CLT) will apply since the sample size is large (n = 98). The mean of
the differences will be approximately normally distributed in this case.
d. [3 marks] Conduct a suitable hypothesis test for the students who are left-hand dominant. In
your analysis, carefully justify your choice of test, including a detailed examination of the assumptions
required for the selected test. After conducting the test, you need to write a clear conclusion that is
supported by relevant evidence. This conclusion should summarise the findings and explain how the
evidence supports these outcomes.
ANSWER As the data for left-hand dominant individuals is not normally distributed and the sample size is
small, we can use a non-parametric test to compare the difference.
#non-parametric test to compare mean difference
wilcox.test(LDH$difference)
##
## Wilcoxon signed rank exact test
##
## data: LDH$difference
5
## V = 1, p-value = 0.0009766
## alternative hypothesis: true location is not equal to 0
Given the non-normality in the small sample of left-hand dominant individuals, a non-parametric test was
considered appropriate. Left-hand dominant individuals showed strong evidence of differences in the grains
moved by each hand (Wilcoxon test V=1, P=0.00098), moving more grains with their left hand.
a. [5 marks] We are interested in whether there is an association between the number of languages spoken
and where people are from (metropolitan or regional). Treating Languages as a categorical variable,
carry out an appropriate hypothesis test. You will need to ensure that you meet the assumptions for
this test. In your answer:
• state your null hypothesis and your alternative hypothesis;
• list the assumptions and show they are met;
• state the test statistic, its distribution, and report the p-value
• describe in plain English what the results mean.
ANSWER
# extract relevant vectors
Location.Languages <- demographics[,c(3,7)]
# Expected values
#Metropolitan
TotCol*(TotRow[1])/GrandTot
## 1 2 3 4
## 72.4692737 40.9608939 26.7821229 0.7877095
#Regional
TotCol*(TotRow[2])/GrandTot
## 1 2 3 4
## 19.5307263 11.0391061 7.2178771 0.2122905
## OR you can use the shortcut: this will warn us of the issue.
chisq.test(tab1)
6
## Pearson s Chi-squared test
##
## data: tab1
## X-squared = 1.7909, df = 3, p-value = 0.6169
\textcolor{violet}{This warns us because we have failed the assumption for the ‰2 test. Essentially, we need
enough data, so we want 80% of expected values > 5 and all expected values > 1. So, we’ll change to “3 or
more languages”, which should fix it.}
Location.Languages.Modified <- Location.Languages
Location.Languages.Modified["Languages"][Location.Languages.Modified["Languages"] >= 3] <- 3
# Find totals
TotRow2 <- rowSums(tab2)
TotCol2 <- colSums(tab2)
GrandTot2 <- sum(tab2)
## 1 2 3
## 72.46927 40.96089 27.56983
#Regional
TotCol2*(TotRow2[2])/GrandTot2
## 1 2 3
## 19.530726 11.039106 7.430168
## Now just do the test:
chisq.test(tab2)
##
## Pearson s Chi-squared test
##
## data: tab2
## X-squared = 1.61, df = 2, p-value = 0.4471
H0 : no association between where students are from and number of languages spoken
H1 : there is an association between where students are from and number of languages spoken
The only assumption is that there is sufficient data, which we have already detailed above and shown it is
now satisfied by the second test. Based on this sample, no association was found between where students are
from (metropolitan/regional) and the number of languages spoken (‰2 (df = 2) = 1.61, p = 0.447).
b. [4 marks] We are interested in whether there is a difference between the domestic and international
students predicted final mark. Run an analysis of variance to test whether there is significant difference
between the mean of predicted final mark for international and domestic students. Use PredictFinal
and Studentcolumns in DCE12024 data set to run this analysis. Include your codes and the
outputs.
(i) What were the hypotheses being tested here?
(ii) Explain (in details) what does these results indicate?
7
ANSWERCan do this with either a t-test or a linear model, depending on whether they meet the assumptions
(a linear model requires equal variances, this is optional for a t-test).
#Using a t-test
##
## Two Sample t-test
##
## data: PredictFinal by Student
## t = -5.3106, df = 177, p-value = 3.248e-07
## alternative hypothesis: true difference in means between group Domestic and group International is no
## 95 percent confidence interval:
## -8.991963 -4.119627
## sample estimates:
## mean in group Domestic mean in group International
## 72.79508 79.35088
#Using a linear model
demo.model <- lm(PredictFinal~Student, data=demographics)
par(mfrow=c(1,2))
plot(demo.model,which=1)
plot(demo.model,which=2)
2
Standardized residuals
10
1
Residuals
0
−10
−1
−2
−20
110
95
67
−3
6795
110
73 75 77 79 −2 −1 0 1 2
##
## Call:
## lm(formula = PredictFinal ~ Student, data = demographics)
##
## Residuals:
8
## Min 1Q Median 3Q Max
## -22.7951 -4.3509 0.6491 5.6491 19.6491
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 72.7951 0.6966 104.499 < 2e-16 ***
## StudentInternational 6.5558 1.2345 5.311 3.25e-07 ***
## ---
## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
##
## Residual standard error: 7.694 on 177 degrees of freedom
## Multiple R-squared: 0.1374, Adjusted R-squared: 0.1326
## F-statistic: 28.2 on 1 and 177 DF, p-value: 3.248e-07
#You can also get an anova table for the model
anova(demo.model)
9
MAST20031 Analysis of Biological Data - Assignment 1
Solutions
Tara and Paul
Table of Contents
Problem 0 [1 mark]
Participate in the data collection exercise. This was automatically recorded.
d <- read.csv("DCE2_2024.csv")
# NOTE: There are many other ways to do this, e.g. using na.omit()
# The summaries would also be correct even if the NA values were left in,
# but there would also be an "NA" column.
left <- d[d$dominantHand=="Left" & !is.na(d$dominantHand), ]
right <- d[d$dominantHand=="Right" & !is.na(d$dominantHand), ]
summary(right[right$Hand=="R","nGrains"])
b. [3 marks] For Left-dominant individuals, produce and report a paired box plot
(two boxplots within a single graph) for 𝑮 vs Hand. Also make one of these paired
plots for Right-dominant individuals. Label the graphs appropriately.
par(mfrow=c(2,1))
boxplot(left$nGrains~left$Hand,
ylab="Hand used",xlab="Number of Grains",main="Grains of rice moved
by left-handed individuals",
ylim=c(0,50), horizontal=TRUE)
boxplot(right$nGrains~right$Hand,
ylab="Hand used", xlab="Number of Grains",main="Grains of rice moved
by right-handed individuals",
ylim=c(0,50), horizontal=TRUE)
c. [4 marks] Based on (a) and (b), briefly comment on differences or similarities
in terms of centre and spread of the distributions of number of grains for
dominant hand compared to non-dominant hand. Provide a biological
justification to your findings.
With some skew and some outliers, the median/IQR are better measures of centre/spread.
Even if a few groups seem symmetric, it doesn’t make sense to compare a standard
deviation to an IQR, for example.
ANSWER
For left handed people, it seems like the left hand is slightly better at manipulating the rice
grains (Median: L=25, R=22). Similarly for the right handed people, the dominant hand has
a higher typical value (Median: L=21, R=24).
There is a wider spread of data in the left-hand trials than the right-hand trials for the left
handed people (IQR: L=8.5, R=6). Right-handed people also showed more variability in
their dominant hand (IQR: L=6, R=9). Overall the variability is higher in right-handed
people than in left-handed people.
Possible Biological explanations:
• People use their dominant hand more frequently, and this may lead to greater
dexterity (consistent with slight difference between the dominant and non-
dominant hands in both groups)
……….
d. [2 marks] Potential outliers are observations that fall outside some pre-
determined ``fences’’, typically 𝟏. 𝟓 × 𝑰𝑸𝑹 beyond 𝑸𝟏 and 𝑸𝟑 . The number of
outliers in these data is given below. Give a possible biological explanation for
the pattern in the observed outliers.
Number of low outliers Number of high outliers
Left hand dominant Left 0 1
Right 0 0
Right hand dominant Left 2 11
Right 0 8
ANSWER
Most outliers occurred in right-handed group, and most were unusually high values. A
possible biological explanation could be that some right-handed people can be equally
dexterity in both hands (e.g. because they routinely perform/practice tasks which require
both hands, such as playing the piano). Moreover, some right-handed people might be
extremely bad at using their non-dominant (left) hand, so we also got some very low outlier
in right-handed group when use their non-dominant hand.
……….
e. [3 marks] Use the qqnorm and qqline commands to produce normal QQ-plots
for 𝑮, for each combination of hand and dominant hand (a total of four plots).
Does the normal distribution seem a good statistical model for these data? Why
or why not?
ANSWER
# Step 1. Name the subsets
LeftyL <- left[left$Hand=="L","nGrains"]
LeftyR <- left[left$Hand=="R","nGrains"]
par(mfrow=c(2,2))
Problem 2 [6 marks]
a. [3 marks] Consider just the left-handed individuals. We want to estimate the
population standard deviation in the number of grains (overall, ignoring which
hand was being used). Write some code to calculate a 95% bootstrap confidence
interval for the population standard deviation σ. Add comments to explain what
your code is doing; you will be assessed on both your code and your comments
NOTE: you must write some code, not use other functions and/or packages to do this.
ANSWER
original.data <- left$nGrains
## 2.5%
## 5.12975
## 97.5%
## 7.8205
……….
ANSWER
It makes sense to calculate a bCI for the median and standard deviation, because these
don’t follow normal distributions, typically. You could calculate one for the mean, too, as
bootstrap intervals are always an option. However, if there is a large sample size (applies to
this problem with n = 80), then even if the original data are not normal then the central
limit theorem (CLT) means that a standard CI for the mean 𝜇 will be appropriate, and hence
better than a bCI.
……….
Problem 3 [6 marks]
In the final problem we will examine whether there is a learning effect in our data - do
students get better at this task with each trial?
a. [1 mark] Plot four histograms: the number of grains across all trials, and
separately for each trial. Use the arguments xlim=c(0, 50), breaks=25 for all
plots.
ANSWER
First create the subsets
TrialAll <- d$nGrains
Trial1 <- d$nGrains[d$Replicate==1]
Trial2 <- d$nGrains[d$Replicate==2]
Trial3 <- d$nGrains[d$Replicate==3]
hist(TrialAll,
xlim=c(0, 50), breaks=25,
main="Histogram for all trials", xlab="Number of Grains")
hist(Trial1,
xlim=c(0, 50), breaks=25,
main="Histogram for Trial 1", xlab="Number of Grains")
hist(Trial2,
xlim=c(0, 50), breaks=25,
main="Histogram for Trial 2", xlab="Number of Grains")
hist(Trial3,
xlim=c(0, 50), breaks=25,
main="Histogram for Trial 3", xlab="Number of Grains")
……….
b. [2 marks] Compute the sample mean, 𝒙‾, and sample standard deviation, 𝒔,
for the number of grains in the entire dataset (702 rows) and for the first,
second and third trial separately. Briefly compare and contrast the findings from
the original and separated data.
ANSWER
Mean SD
Overall 23.802 6.844
Trial 1 22.674 6.607
Trial 2 23.831 6.738
Trial 3 24.901 7.027
The distribution of the data is quite similar in all cases, with only slight differences in the
means and standard deviations.
The mean increases by approx. 1 grain on each trial (T1=22.7, T2=23.8, T3=24.9). However,
the standard deviation also increases very slightly with the trial.
In order to determine if there is a learning effect, it would be necessary to evaluate if these
differences between trials are significant.
……….
c. [3 marks] Compute the 95% confidence interval across the entire dataset and
for each trial separately. Does it seem like students are becoming better at this
task with each trial? Why or why not?
ANSWER
Although there is some evidence of positive skew, because the sample size is large the
central limit theorem applies and we can assume normality for calculating these intervals.
However, since we do not know the standard deviation of the population, it is best to use t
distribution.
𝐶𝐼 = 𝑥 ± 𝑡 × 𝑆𝐸
All
SE_All = sd(TrialAll)/sqrt(length(TrialAll))
tCrit_All = c(-1,1) * qt(0.975,(length(TrialAll)-1))
Trial1
SE_T1 = sd(Trial1)/sqrt(length(Trial1))
tCrit_T1 = c(-1,1) * qt(0.975,(length(Trial1)-1))
Trial2
SE_T2 = sd(Trial2)/sqrt(length(Trial2))
tCrit_T2 = c(-1,1) * qt(0.975,(length(Trial2)-1))
At first glance it seems like students are getting better since the interval gradually moves
towards larger numbers suggesting that more grains were handled during the same
amount of time. However, all these intervals overlap, thus we can not be sure that the true
means for the population on each trial are different. This analysis does not provide enough
evidence to sustain that there is a learning effect (students are getting better at this task
with each trial).
……….