0% found this document useful (0 votes)
14 views

ABD Formulas

Uploaded by

2384899399
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

ABD Formulas

Uploaded by

2384899399
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Distributions

All distributions For any distribution (discrete or continuous) we can write:


E(X) = µX
V ar(X) = X 2
Discrete distributions Probability mass function (pmf): p(x) = P (X = x)
X density function (cdf): F (x) = P (X  x)
Cumulative
E(X) = x.p(x)
x2⌦
V ar(X) = E(X 2 ) [E(X)]2
Binomial X ⇠ Binom(n,
✓ ◆ p)
n x
p(x) = p (1 p)n x
x
E(X) = np
V ar(X) = np(1 p)
Poisson X ⇠ Pois( )
xe
p(x) = Note: 0! = 1
x!
E(X) =
V ar(X) =
Continuous Distributions Probability density function (pdf): f (x)
Cumulative density function (cdf): F (x) = P (X  x)
Normal X ⇠ N µ, 2
Z ⇠ N (0, 1)
Sampling Distributions For any distribution (any shape) with mean µX and standard deviation X
Mean The mean of a sample of size n from the sample, x̄ will vary from sample to sample.
X is the random variable of the sampling distribution of the mean
E X = µX
2
V ar X = Xn

6 shex̅
6
Central Limit Theorem X will have an approximately normal distribution if n is large enough (typically 30 is enough
for this)
Proportion Pb ⇣is the
⌘ random variable of the sampling distribution of the proportion
E P =pb where p is the proportion in the population
⇣ ⌘
p(1 p)
V ar Pb = n
Likelihood For data x1 , . . . , xn from a distribution with some unknown parameter ✓:

The likelihood L(✓) = P (data|✓)


= P (x1 |✓) ⇥ P (x2 |✓) ⇥ · · · ⇥ P (xn |✓)
The log-likelihood `(✓) = loge (L(✓))

Likelihood Ratio If ✓ is the value from a hypothesis, and ✓ˆ is the maximum likelihood estimate, then:

L(✓)
LR =
ˆ
L(✓)
✓ ◆
b = 2 (`1 L(✓1 )
Likelihood Ratio Test (LRT) ⇤ `0 ) = 2 loge
L(✓0 )
where ✓1 is the parameter from the more complicated model and ✓0 is from the simpler model.
b ⇠ 2 where df is the di↵erence in the degrees of freedom for the two models.
⇤ df
Sample size Generally good to have sample size at least 30 for the Central Limit Theorem (more if data are
badly skewed), but you also might care about other things:
Confidence Intervals If given a margin of error (= distribution value ⇥ variability)
Use table below (last page of these notes) to get a formula which includes an n. . . then solve for
n.
1-sample t-test ⇣ ⌘2
z1 ↵ + z1 2
2
n 2

Where ↵ is the significance level, 1 is the power, 2 is the variance and is the di↵erence
to be detected.
2-sample t-test Assuming equal variances:
⇣ ⌘2
2 z1 ↵ + z1 2
2
n 2

Where ↵ is the significance level, 1 is the power, 2 is the variance and is the di↵erence
to be detected. Note that this is the number for each group, so the total sample size is 2n.
1-proportion test
p0 (1 p0 )(z1 ↵
2
)2 + p1 (1 p1 )(z1 )2
n 2

Where ↵ is the significance level, 1 is the power, is the di↵erence to be detected, p0 is the
null hypothesis proportion and p1 is the value in the range (p0 , p0 + ) which is closest to
0.5.
Generalised Linear Models All models have a deterministic part which gives an equation involving the predictors, and a
stochastic part which incorporates the “randomness” in data (often called the “error distribu-
tion”).
Linear Model
2
yi = 0 + 1 x1i + 2 x2i + ··· + k xki +"i "i ⇠ N (0, )

Uses Normal error distribution (mean zero and constant variance). Often called “(Multiple)
Regression” when all predictors are numerical, “ANOVA” when all predictors are categorical,
and “ANCOVA” when there is a mixture.
Logistic Regression
logit(⇡i )= 0 + 1 x1i + 2 x2i + ··· + k xki yi ⇠ Bern(⇡i )

Bernoulli (or Binomial) model for randomness. Uses the logit link function to transform linear
equation into probabilities (restricted to between 0 and 1).
✓ ◆
⇡i eT
logit(⇡i ) = log inverse logit: ⇡i = if T = logit(⇡i )
1 ⇡i 1 + eT

Poisson Regression
log( i )= 0 + 1 x1i + 2 x2i + ··· + k xki yi ⇠ Pois( i )

Poisson model for randomness. Uses the log link function to transform linear equation into
positive values.

log( i ) = T inverse log: i = eT


Inference
Confidence Intervals: a range of plausible values for a population parameter

CI(parameter) = estimate ± distribution value ⇥ variability

Hypothesis testing: is the sample result consistent with the hypothesised value of the parameter?

( ) Thinking process for conducting hypothesis tests:

1. What is the research question? (define variables/parameters; state hypotheses)

2. What sort of estimates would we expect if H0 is true? (assuming H0 is true, what does this tell us about the distribution
of the sample estimates we could expect?)

3. How extreme is our sample estimate? Is our sample result close to what we expected if H0 is true? (calculate test
statistic/p-value)

estimate H0 value
test statistic =
variability

4. Decision? (reject or retain H0 ) Is our sample result too extreme/unusual? (compare sample results to critical
value/significance level)

5. Conclusion? (in the context of the research question)


parameter estimate distribution value variability notes

µ x̄ z sd(X̄) = p
n
when is known.

µ x̄ tn 1 se(X̄) = ps when is unknown.


n

µD d¯ tnp 1 se(D̄) = sD
p
np paired data. First calculate di↵erences (D) then same as above.
np = number of pairs.

q
1 1
µ1 µ2 x̄1 x̄2 tdf se(X̄1 X̄2 ) = spooled ⇥ n1 + n2 2 independent
q samples, unpaired, assuming 1 = 2 .
(n1 1)s1 2 +(n2 1)s2 2
spooled = n1 +n2 2 and df = n1 + n2 2; or use
p
spooled = M Serror and df = dferror from ANOVA.

q
p(1 p)
p p̂ z sd(Pb) = n for hypothesis tests (use H0 p for standard error).
q
p̂(1 p̂)
p p̂ z se(Pb) = n for confidence intervals (no hypothesis value, so use sample p̂).

⇣ ⌘ ⇣ ⇣ ⌘⌘ q
log (OR) d
log OR z d
se log OR = a1 + 1
+ 1
+ 1
using the cell values a, b, c, d to calculate the standard error. Inter-
b c d
val needs to be back-transformed (elower , eupper ) after calculating
the interval.
Can also get this from a logistic regression (GLM with logit
link). . . R will calculate the estimate and standard error.

ˆ tdf se( ˆ) Linear models (works for any parameter in the model). Use match-
ing se from R. df = n #(parameters in model) = dferror .

ˆ z se( ˆ) Generalised linear models (except for Normal error distribution).


Works for any parameter in the model, use appropriate se from R.
distribution

t student distribution
tech i
normal nucm 2 Z 0,1
unknown
opulationmean is known t population standard deviation is
spulation
standard deviation is know d known s
standard deviation thesample is
has a distributionvalue 11.96
of
95 confident level Has a distribution value based on the data
called fcritical notethat it is not 1.96

sample size is small 30

t critical qt 0.975 n t
t.no s i.gg
I Evarybased
we can
standardise
the value by
f fon the
x̅ is the samplevalue
If where context

SEis thestandarderror I
CI x̅ tential SE
norm 0.975 1.96
SE
1.95
0.025

Porportion distribution Bi n p

j P sample proportion
syless
pct I

This is only usedwhen we


align variance
ssume the data is normally

stributed and n is P will be distributed


samplesizeis
if
normally
large enough

than30
ge greater
commonly

CI I 1.96 X SE

E
standarddeviation

H lt
tdhffdvaine
of
Chisq distribution
Point estimation
likelihood
test square
data themodel
of given
distance
probability
Ss sum square
of
The model has a PDF Pmf
Sa EI 05
number
LCA PLX A
estimate
estimator distribution P A Pex A pan a

i ith rate Pex A


I
e ss to obtain
pex
minimizing

he bes estimate
x px
Pex a pixn x

we the likelihood to
maximizing

the best estimate


get
Test statistics
wedon'tknowaboutthe population
we estimate it bychoosing a sample RV EV
numerial C2levels
t x̅ categorical
6 5 Paired unpaired t test
p
levels
numerial categoria't at
he sample we choose need to betherepresentative

the population
ANOVA twoway ANOVA

How visualizing
plot
box F test
normally distributed
summary numerical t numerial
categorial
random treatments
assigned

Independent
varibles ANOVA
numerial numerial
linear
regression
Categorical categorical

expected

Chisquare

between variables
Independency categorical

we compare proportion
and categoricalvariables
observed value theori
or against any
alne a Chisquare test
using

eg w um um um

904 2023 912 3839

Expertedfrequency

P Pt's PE 9043839 t 2023 3839


0.498g 9459495

3.84
fill
q 9241112 9123839 I 20233839 0.501

expected
contribution p q

6.49895 3839 955.531 P 1 PE

UM 04989 10.5 1
3839 1932.599 P2
mum 0.51 5839 988.52

1949555172 2023 1950.59912 912 988.5212


955.53 1930.59g 988.52
1
13.18
13.18 3.84
row 1 I column 1
1
I we can only use x to testsignificance we
tell whichgroup is more likely thanother
can't

odd ratio
relative risk

f
PCA4 PA a
A
RR 1 PIA
1 PLA42
A a the probability obtaining feature
ORCA OCA 41
of peaas
1 giventhat thegroup is 4 O CAGa

ORIA I equal likely


128 316
ORCA I 4 is more likely
IB F
I 8,2 0.66
ORCA I 4 is more likely
2
Bm 6 0.655

normal distribution
R s 1.0016
Lyok follows a

emale is 1.0076 times morelikely to be as well is


if
n
large
lark head thanmale

are
thy
for red head
if
we RR
SEl.g.ie ÉÉi abind are the
counts in eachcell
PCR FI
for CI 1.96
of the contingent
table
PCR M
Lyirt SEgin
RR I Ho is accepted no
if
saliation between 2 variables

log
RR follows a normaldistribution

I SE
lyrr 1.96
t.gr
we can unly it by using exponential
µ
one sample t test
wehave list data and
a
of
nt to check the difference between
le mean and themeanif interestis
of t
giant we use onesample test

4
3 mean is 4
2 4 Ie sample

Hi thesample meanis notequal


4 to 4 x̅ 4

IT sample meanis
greater
than

or equalto 4

x̅ 4.8333
5 5 3 8333
4548
y
1
If 18 8

2.385
SE 2.385
0.9136 d in I

t 4 4
0.856
4 DP 0.05

PL 70.85b Ho I 4.812.57 0.9738


p value
2.29 7.3
p 44
does include 4

two tailtest for 0.851 C 2.57


Hi x̅ 44 we do nothaveenough
evidence to it
In R 9 0.915,41 5 2.57 regent

Pt 0.856dy5 lowertailFALSE
0.22
paired t test

Paired means the data


comparing
are takenfromthesame individuals

Ho there is no difference in the mean between

group A and B A 173 MA143 0

It there is a difference in themean between

group A and B YA

SD different in standard deviation

2 3
5 2 ys 2 in paired data n is always

paired so the sample size is


4 5
5 7
thesame for gry's

2.109
5 3.8 5 7.4 SE f 0.94
Sp Sx 54 2.109 to 3.8 1.4
3.83
0.94

I 3.8 2.78 0.94


Sinie CI does not involve zero
1.19 6.41
and P 20.05
Pe 0.0093 we have evidence to

186
enough get
2ft 0
the null hypothesis and t.fr is
significant
unpaired t test

2 Samples notfromthesame individuals and


1 0.3179
a number trails in each notthesame
of sample might

Do thepeoplefromschool have thesame


gash see
asthepeople from sixty
ftp.fqqffff
H Tx Ty
t.iq b ing
Hi thxFty

I SE
t.EE
spool
d nitna 2

51 1.01 spot
5 3.22
x̅ 3.4
É6
f
1313.2212
5 5 spot
5 6 2
5 3.4 2.65
3.26
0.49
SE 2 65 555 3.26

91 2 26 1 1.6 I 2.26 3.26


1 516 8.97 overlaps zero

0.05
Pt 0.3179
2Pt 0.6358

we do not have enough


evidence
to Ho
reject
Assignment 3_2024

Tara and Paul

2024-05-10

Instructions
• Assignment 3 contains 2 problems worth a total of 32 marks.

Pollinators dataset will be used to answer Problem 1. This dataset contains information on
two species of Sea Rocket Cakile maritima (Maritima) and Cakile edentula (Edentula), and
their first generation hybrid plants (F1 Hybrid). Cakile sp. is an invasive weed in Australia
and we are interested in investigating the interactions between these species and their insect
pollinators. We collcted data on the number of flowers plants produced over time, and we are
intersted to see whether there is a correlation between the number flowers produced by each
plant and the number of pollinators visited the plants.
The file name for this dataset is “pollinators.csv”. There is one row for each individual plant. Column Species
is the species of the plant; NumberofFlowers is the number of open flowers on each plant; NumberofPollinator
is the number of insect pollinators visited that plant; Date is the day data were recorded.
Download and save the data in your Working Directory.
Flower <- read.csv("pollinators.csv")

##Problem 1 [17 mark]


a. [2 Marks] Create two plots to visualise your data. First, plot boxplots of Number of Pollinator vs
Species (use different colour for species and label axes appropriately). Second, plot a scatter plot of the
relationship between the Number of Flowers and Number of Pollinator.
#box plot (1 mark)
boxplot(Flower$NumberofPollinator ~ Flower$Species,
col = rainbow(ncol(Flower)),
xlab = "Plant species", ylab = "Pollinator frequency")

1
80
Pollinator frequency

60
40
20
0

Edentula F1 Hybrid Maritima

Plant species
#scatter plot (1 mark)
plot(Flower$NumberofFlowers, Flower$NumberofPollinator, main = "Scatterplot")

Scatterplot
Flower$NumberofPollinator

80
60
40
20
0

0 50 100 150

Flower$NumberofFlowers
b. [1 Marks] Fit a linear model to the response variable (Number of Pollinator) and predictor (Number
of Flowers). Add the fitted regression line from this linear model to the scatter plot you made in part a.
pol_lm <- lm(NumberofPollinator ~ NumberofFlowers, data = Flower)
plot(Flower$NumberofFlowers, Flower$NumberofPollinator, main = "Scatterplot")
abline(pol_lm, col = "red")

2
Flower$NumberofPollinator Scatterplot
80
60
40
20
0

0 50 100 150

Flower$NumberofFlowers
c. [1 Marks] Is there a strong correlation between the number of flowers each plant produced and the
number of pollinators visited them? Why?
#Problem 1 _ Correlation and lm
cor(Flower$NumberofFlowers, Flower$NumberofPollinator)

## [1] 0.9137602
[ANSWER] Yes, there is a strong positive correlation between the number of flowers each plant produced
and the number of pollinators visited them, as 91.37 percent of increase in number of pollinators can be
explained by the increase in the number of flowers.
d. [2 Marks] We are interested to see whether there is an effect of flower number and plant species on
the number of pollinators visited these plants. Write down the most appropriate statistical test to see
the effect of these two predictors on the number of pollinators. Clearly interpret the output of the
statistical test.
#Problem 1 _ ANCOVA
pollinator_lm <- lm(NumberofPollinator ~ Species + NumberofFlowers, data = Flower)
summary(pollinator_lm)

##
## Call:
## lm(formula = NumberofPollinator ~ Species + NumberofFlowers,
## data = Flower)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.574 -4.711 0.141 3.943 30.777
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)

3
## (Intercept) -3.94327 0.99138 -3.978 9.26e-05 ***
## SpeciesF1 Hybrid 9.54214 1.38448 6.892 4.97e-11 ***
## SpeciesMaritima 8.48277 1.36546 6.212 2.34e-09 ***
## NumberofFlowers 0.58455 0.01661 35.187 < 2e-16 ***
## ---
## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
##
## Residual standard error: 8.473 on 236 degrees of freedom
## Multiple R-squared: 0.867, Adjusted R-squared: 0.8653
## F-statistic: 512.8 on 3 and 236 DF, p-value: < 2.2e-16
anova(pollinator_lm)

## Analysis of Variance Table


##
## Response: NumberofPollinator
## Df Sum Sq Mean Sq F value Pr(>F)
## Species 2 21556 10778 150.12 < 2.2e-16 ***
## NumberofFlowers 1 88896 88896 1238.15 < 2.2e-16 ***
## Residuals 236 16944 72
## ---
## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
[ANSWER] The Analysis of Variance shows significant effect of Speices (plant types) and number of flowers
each plant produced on the number of pollinators foraged each plant

(P < 2.2e ≠ 16)

.
e. [2 Marks] Is there any significant interaction between species and flower number? Fit a model to test
the interaction between the predictors. Clearly interpret the output of the model.
#Problem 1 _ interaction
model_interaction <- lm(NumberofPollinator ~ Species * NumberofFlowers, data = Flower)
summary(model_interaction)

##
## Call:
## lm(formula = NumberofPollinator ~ Species * NumberofFlowers,
## data = Flower)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.539 -2.721 -1.373 2.909 27.883
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.72059 0.91475 2.974 0.00325 **
## SpeciesF1 Hybrid -0.39984 1.49349 -0.268 0.78915
## SpeciesMaritima -0.38259 1.29686 -0.295 0.76824
## NumberofFlowers 0.20565 0.03177 6.474 5.56e-10 ***
## SpeciesF1 Hybrid:NumberofFlowers 0.46382 0.03991 11.623 < 2e-16 ***
## SpeciesMaritima:NumberofFlowers 0.44469 0.03599 12.355 < 2e-16 ***
## ---
## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
##

4
## Residual standard error: 6.478 on 234 degrees of freedom
## Multiple R-squared: 0.9229, Adjusted R-squared: 0.9213
## F-statistic: 560.3 on 5 and 234 DF, p-value: < 2.2e-16
anova(model_interaction)

## Analysis of Variance Table


##
## Response: NumberofPollinator
## Df Sum Sq Mean Sq F value Pr(>F)
## Species 2 21556 10778 256.813 < 2.2e-16 ***
## NumberofFlowers 1 88896 88896 2118.187 < 2.2e-16 ***
## Species:NumberofFlowers 2 7124 3562 84.871 < 2.2e-16 ***
## Residuals 234 9820 42
## ---
## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
[ANSWER] Yes, the Analysis of Variance shows significant effect of interaction between plant speices and
number of flowers (Species:NumberofFlowers) on the number of pollinators foraged each plant

(P < 2.2e ≠ 16)

.
f. [2 Marks] Does your model (model in question 1e) meet the assumptions? Why?
#Problem 1 _ assumption check
plot(model_interaction, which = c(1,2))

Residuals vs Fitted
30

211
237
213
20
10
Residuals

0
−20 −10

0 20 40 60 80 100

Fitted values
lm(NumberofPollinator ~ Species * NumberofFlowers)

5
Q−Q Residuals

211
237
4
Standardized residuals

2
0
−2

226

−3 −2 −1 0 1 2 3

Theoretical Quantiles
lm(NumberofPollinator ~ Species * NumberofFlowers)
[ANSWER] It seems that the model meets the assumption of residuals normal distribution. we can use
residual vs fitted value plot and QQ residuals plot to check the distribution of errors in a model. The residual
vs fitted value plot shows almost equal error variances and residuals seem to be randomly distributed. The
QQ residuals plot shows the distribution of error in the model; the error distribution look almost normal.
However, it is skewed in both tails. Note: for this question, we would accept either (yes or no) answer, as
long as you explain correct reasoning.
g. [3 Marks] Use the function anova() in R to compare your models (models with and without interaction)
in part d and e using a likelihood ratio test (LRT). State the null hypothesis for this test, the distribution
for the test statistic under the null hypothesis (specify also the degrees of freedom), and give a p-value.
# Problem 1 _ Model comparison
anova(pollinator_lm, model_interaction, test = "LRT")

## Analysis of Variance Table


##
## Model 1: NumberofPollinator ~ Species + NumberofFlowers
## Model 2: NumberofPollinator ~ Species * NumberofFlowers
## Res.Df RSS Df Sum of Sq Pr(>Chi)
## 1 236 16944.2
## 2 234 9820.5 2 7123.7 < 2.2e-16 ***
## ---
## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
[ANSWER] The Null Hypothesis: The simpler model with no interaction is sufficient, meaning the additional
parameters in the more complex model do not significantly improve the model fit (i.e., the interaction term
between Species and NumberofFlowers does not significantly improve the model fit). The Alternative
Hypothesis: The more complex model provides a significantly better fit to the data than the simpler model
(i.e., the interaction term significantly improves the model fit.). Under the null hypothesis, the test statistic
used in the Likelihood Ratio Test (LRT) follows a chi-squared distribution. The degrees of freedom for the
chi-squared distribution is equal to the difference in the number of parameters (or degrees of freedom) between
the two models. In the ANOVA table, this difference is 2 (df = 2), corresponding to the two additional

6
parameters (interaction terms) in Model 2 compared to Model 1. Given the very small p-value (< 2.2e-16),
we reject the null hypothesis. This means that adding the interaction term (Species * NumberofFlowers)
significantly improves the model fit. The interaction between Species and NumberofFlowers is important for
predicting the number of pollinators.
h. [4 Marks] We would like to know whether there is any significant difference between the number of
flowers each species produced or not. Run an appropriate statistical test to test if there is any significant
difference between the number of flowers each species (Maritima, Edentula and F1 Hybrid) produced.
Clearly interpret the output of the statistical test. Then, run a post hoc TUKEY test to see which two
species produced significantly higher number of flowers? To run a Tukey post hoc test, use TukeyHSD()
function to find confidence intervals for the difference between means for pairs of species, and use this
to see which two species produced statistically significant different number of flowers. Clearly interpret
the output of the post hoc test.
ANOVA_1 <- aov(NumberofFlowers ~ Species, data = Flower) #1mark
summary(ANOVA_1) #1 mark for interpretation of the summary

## Df Sum Sq Mean Sq F value Pr(>F)


## Species 2 19198 9599 8.745 0.000217 ***
## Residuals 237 260158 1098
## ---
## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
TukeyHSD(ANOVA_1) #1 mark + 1 mark for interpretation

## Tukey multiple comparisons of means


## 95% family-wise confidence level
##
## Fit: aov(formula = NumberofFlowers ~ Species, data = Flower)
##
## $Species
## diff lwr upr p adj
## F1 Hybrid-Edentula 21.0125 8.657095 33.367905 0.0002385
## Maritima-Edentula 15.8750 3.519595 28.230405 0.0076188
## Maritima-F1 Hybrid -5.1375 -17.492905 7.217905 0.5897786
[ANSWER] The results of the Tukey post hoc test show Edentula plants produced significantly lower
number of flowers comapred to F1 Hybrid
(P = 0.00023)
and Maritima
(P = 0.0076)
. However, there is no significant difference between the number of flowers produced in Maritima and F1
Hybrid plants
(P = 0.5897)
.

Problem 2 [15 marks]


In this problem you will analyse the data on the effect of a pesticide on aphids mortality mort. The file name
for this dataset is “aphids.csv”. In this dataset, we tested mortality rate of four different populations pop of
aphids when they were exposed to different concentrations conc of a particular pesticide. We are interested
to see whether any of these populations have developed resistance against this pesticide.
aphid <- read.csv("aphids.csv")

7
a. [3 Marks] Fit a linear model to test the difference between mean mortality of these aphid populations
when they were exposed to different concentrations of the pesticide. Check the residual (error)
distribution. Does your model provide a good fit to the data? Why or why not?
#lm with no transformation
Model1 <- lm( data= aphid, mort ~ conc + pop)
summary(Model1)

##
## Call:
## lm(formula = mort ~ conc + pop, data = aphid)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.44470 -0.21604 -0.04101 0.16218 0.57847
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.385419 0.044358 8.689 3.7e-15 ***
## conc 0.064452 0.006263 10.291 < 2e-16 ***
## popForbes5 0.052837 0.061136 0.864 0.389
## popKyabram98 0.049634 0.061136 0.812 0.418
## popOsborne171 -0.028344 0.061136 -0.464 0.644
## ---
## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
##
## Residual standard error: 0.2802 on 163 degrees of freedom
## Multiple R-squared: 0.3994, Adjusted R-squared: 0.3847
## F-statistic: 27.1 on 4 and 163 DF, p-value: < 2.2e-16
plot(Model1, which = c(1,2))

Residuals vs Fitted
0.6

15831
34
0.4
0.2
Residuals

0.0
−0.4

0.4 0.6 0.8 1.0

Fitted values
lm(mort ~ conc + pop)

8
Q−Q Residuals

158
3134
2
Standardized residuals

1
0
−1

−2 −1 0 1 2

Theoretical Quantiles
lm(mort ~ conc + pop) [AN-
SWER] No, the model does not seem a good fit as the Residual vs Fitted value plot and QQ Residuals plot
do not show normal distribution of errors.
b. [3 Marks] Fit another linear model and log transform the concentration of the pesticide in this model.
Check the residual (error) distribution. Does your model provide a good fit to the data? Why or why
not?
#lm with log transformation for pesticide concentration
Model2 <- lm( data = aphid, mort ~ I(log(conc)) + pop)
summary(Model2)

##
## Call:
## lm(formula = mort ~ I(log(conc)) + pop, data = aphid)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.70036 -0.07549 0.01228 0.09352 0.38801
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.807313 0.026846 30.072 <2e-16 ***
## I(log(conc)) 0.069398 0.002607 26.620 <2e-16 ***
## popForbes5 0.052837 0.033957 1.556 0.122
## popKyabram98 0.049634 0.033957 1.462 0.146
## popOsborne171 -0.028344 0.033957 -0.835 0.405
## ---
## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
##
## Residual standard error: 0.1556 on 163 degrees of freedom
## Multiple R-squared: 0.8147, Adjusted R-squared: 0.8102
## F-statistic: 179.2 on 4 and 163 DF, p-value: < 2.2e-16

9
plot(Model2, which = c(1,2))

0.0 0.2 0.4 Residuals vs Fitted

62
Residuals

−0.4

26

67
−0.8

0.0 0.2 0.4 0.6 0.8 1.0

Fitted values
lm(mort ~ I(log(conc)) + pop)
Q−Q Residuals

62
2
Standardized residuals

0
−2

26
−4

67

−2 −1 0 1 2

Theoretical Quantiles
lm(mort ~ I(log(conc)) + pop) [AN-
SWER] Yes, the model appears to be a good fit as both Residual vs Fitted value plot and QQ Residuals
plot show residuals follow a normal distribution after logarithm transformation. Logarithm transformation
can stabilise variance and make the distribution of residuals more normal. This transformation is particularly
useful when dealing with skewed data or heteroscedasticity (non-constant variance). This means the model is
more likely to meet the assumptions of linear regression, leading to more accurate predictions and valid

10
statistical inferences.
c. [3 Marks] Fit a generalized linear model. What distribution family would your choose for this model?
Why? Check the residual (error) distribution. How has the distribution of the error changed compare
to the two previous models?
#glm
Model3 <- glm( data = aphid, mort ~ I(log(conc)) + pop, weights=total,
family = binomial(link = "logit"))
summary(Model3)

##
## Call:
## glm(formula = mort ~ I(log(conc)) + pop, family = binomial(link = "logit"),
## data = aphid, weights = total)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.71045 0.15503 11.033 <2e-16 ***
## I(log(conc)) 0.39314 0.01861 21.130 <2e-16 ***
## popForbes5 0.39088 0.18635 2.098 0.0359 *
## popKyabram98 0.35505 0.18656 1.903 0.0570 .
## popOsborne171 -0.17319 0.18355 -0.944 0.3454
## ---
## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1020.8 on 167 degrees of freedom
## Residual deviance: 254.5 on 163 degrees of freedom
## AIC: 546.66
##
## Number of Fisher Scoring iterations: 4
plot(Model3, which = c(1,2))

11
Residuals vs Fitted

4
2 128
3
Pearson Residuals

0
−2
−4
−6

67

−3 −2 −1 0 1 2 3

Predicted values
glm(mort ~ I(log(conc)) + pop)
Q−Q Residuals
6

67
5
Std. Deviance resid.

26
62
3
2
1
0

0.0 0.5 1.0 1.5 2.0 2.5 3.0

Theoretical Quantiles
glm(mort ~ I(log(conc)) + pop)
[ANSWER] The distribition is binomial as we recorded the death or survival of aphids after exposure to
different doses of pesticide.
d. [2 Marks] Which model was the most appropriate model to test whether there are differences between
mean mortality of the aphid populations? Explain why you chose this model. Explain and interpret
the summary of this model.

12
[ANSWER] Both models show similar normality in the Q-Q plots, indicating that the residuals are reasonably
normally distributed for both models. This suggests that both models meet the normality assumption well.
Therefore, both models are equally good. **If someone has chosen onyl one model, any of these reasoning are
acceptable: The GLM is a better model since the data are bionomial. The LM is a better model as it is a
simpler model compare to the GLM.**
e. [4 Marks] Plot the scatter plot of response (Y axis) vs predictor (X axis), with groups (color and/or
different symbols may help here). Add the fitted regression lines from your model. Ensure your graph
is on the original scale and clearly labelled.
[ANSWER] We added the graph after log transformation too, so you can actually compare the graphs
before and after the transformation. Both graphs are acceptable. Note: you must color code the graph using
aphid populations because we want to see how populations’ mortality differs at each concentration level.
library(ggplot2)

ggplot(data= aphid, aes(x= conc, y= mort, col = pop)) +


geom_point() + geom_smooth( method = "glm", se = FALSE,
method.args = list(family = binomial))+
labs(x = "Pesticide concentration",
y = "Aphid mortality (%)", colour = "Aphid population") + theme_bw()

## geom_smooth() using formula = y ~ x


## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

13
1.00

0.75
Aphid mortality (%)

Aphid population
Boggabilla209

0.50 Forbes5
Kyabram98
Osborne171

0.25

0.00

0.0 2.5 5.0 7.5 10.0


Pesticide concentration
#log transformed scale
ggplot(data= aphid, aes(x= log(conc), y= mort, col = pop)) +
geom_point() + geom_smooth( method = "glm", se = FALSE,
method.args = list(family = binomial))+
labs(x = "log( pesticide concentration)",
y = "Aphid mortality (%)", colour = "Aphid population") + theme_bw()

## geom_smooth() using formula = y ~ x


## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

14
1.00

0.75
Aphid mortality (%)

Aphid population
Boggabilla209

0.50 Forbes5
Kyabram98
Osborne171

0.25

0.00

−12 −8 −4 0
log( pesticide concentration)

15
Assignment 2_Solutions

TJ

2024-05-03

Problem 1: Dexterity Study Design [7 marks]


a. [2 marks] How to determine the minimum sample size? Explain how varying the number of
participants might influence the conclusions drawn from the study regarding hand dexterity differences
between dominant and non-dominant hands. Discuss the potential need for a minimum sample size to
achieve statistical significance.
ANSWER The reliability of study results often increases with a larger sample size. In the context of examining
dexterity differences between dominant and non-dominant hands, a larger sample size helps to reduce the
impact of outlier performances or individual variability in dexterity, thus providing a clearer picture of general
trends across the population. Larger samples also enhance the statistical power of the study, making it easier
to detect a true effect if one exists. Moreover, a larger sample size can help ensure the study population
is more representative of the general population, assuming the sample is randomly selected. If the sample
size is too small, the study may not accurately reflect the broader population, and the findings may not be
generalisable. A minimum sample size can be determined using power analysis, considering the expected
effect size, desired power level (commonly 0.8), and alpha level (commonly 0.05).
b. [3 marks] Improving data collection The instructions we gave you to collect data for Data Collection
Exercise 2 were deliberately brief. Re-write the instructions to improve the data collection, especially
improving the consistency between the experimental conditions every student used. You should make
at least three substantial changes to the instructions to improve consistency in data collection.
ANSWER You will be testing how many grains of rice you can place in a cup in 30 seconds. You must only
move one grain of rice at a time. The test will be conducted with each hand exactly three times. You should
use ‘runif(1)‘ to generate a random number, if it is less than 0.5 you should begin with your right hand,
otherwise begin with your left. Then you should alternate hands. For each test, you will record two variables,
which hand and the number of grains moved.
The following steps are recommended to perform the experiment accurately: 1. Wash and thoroughly dry
your hands. 2. Find a cup with a diameter of approximately 7 cm. 3. Place the cup 20cm from the hand to
be tested (on the left of your hand if testing the right hand, or vice versa). 4. Place a small pile (approx
80-100grains) of rice where your hand rests. 5. Setup a timer for 30 seconds. 6. Start the timer with the
hand not being tested, and move grains one at a time until the timer beeps. 7. Do not count any grain still
in your hand, but otherwise count the number of grains in the cup.
c. [2 marks] Examine the potential role of age and physical activity level in dexterity studies.
Consider how differences in age and physical activity level could affect the results of this experiment.
To better understand the influence of age and physical activity level on dexterity outcomes, propose a
method that incorporates these variables into the study design. Note: Incorporate important elements
of a good experimental design into your suggested design.
ANSWER Age and level of physical activity are critical factors that can influence motor skills and dexterity.
Studies have shown that dexterity peaks at a certain age and declines thereafter. For example: developmental
changes or cognitive changes due to age can affect tasks requiring coordination and fine motor control.
Additionally, physical activity level might influence dexterity. For example, individuals who engage regularly
in activities requiring fine motor skills or hand-eye coordination (like playing musical instruments or sports)

1
might have better dexterity. Also, higher levels of physical activity are associated with better overall health,
including muscle strength and neurological health, which can positively impact motor skills. Including these
variables in the study design can provide deeper insights into how dexterity varies not only between dominant
and non-dominant hands but also across different demographic segments. To incorporate age and physical
activity, we can group (blocking) the sample to include a balanced mix of ages and level of physical activity
every individual does. This approach allows us to control for the effects of age and physical activity or to
explore how these factors interact with hand dominance in affecting dexterity.

Problem 2: Analysing Dexterity Differences [14 marks]


dce2 <- read.csv(file = "DCE22024.csv")

The following commands will reorganise the data in a way helpful to answering Problem 2.
# calculate the average nGrains across replicates for each user based on their "dominant hand" and "hand
dce2.agg <- aggregate(dce2, . ~ UserID+Hand+dominantHand, mean, na.action = na.omit)
# Select L and R trials and calculate differences
rTrials <- subset(dce2.agg, dce2.agg$Hand=="R")
rTrials <- rTrials[order(rTrials$UserID), ]
lTrials <- subset(dce2.agg, dce2.agg$Hand=="L")
lTrials <- lTrials[order(lTrials$UserID), ]
RL <- rTrials$nGrains-lTrials$nGrains # Calculates average difference (right-left)
dom.hand <- rTrials$dominantHand # Dominant hand of each person
# Creates a new data frame (dexterity.data) with the relevant variables calculated above
dexterity.data <- data.frame(user = rTrials$UserID, rHand = rTrials$nGrains,
lHand = lTrials$nGrains, difference = RL, dominantHand=dom.hand)

a. [3 marks] Use an appropriate plot to examine if the variable difference (calculated in the code above)
is normally distributed or not. Examine the difference separately for both right and left dominant hand
groups. (NOTE: You should make two plots and label them appropriately.)
ANSWWER
# Subset
RDH <- dexterity.data[dexterity.data$dominantHand == "Right",]
LDH <- dexterity.data[dexterity.data$dominantHand == "Left",]

hist(RDH$difference)

2
Histogram of RDH$difference
60
50
40
Frequency

30
20
10
0

0 10 20 30 40

RDH$difference
hist(LDH$difference)

Histogram of LDH$difference
4
3
Frequency

2
1
0

−12 −10 −8 −6 −4 −2 0 2

LDH$difference
par(mfrow=c(1,2))
qqnorm(RDH$difference, main="QQ-plot of difference \n Right-dominant individuals")
qqline(RDH$difference, col="deeppink3")

3
qqnorm(LDH$difference, main="QQ-plot of difference \n Left-dominant individuals")
qqline(LDH$difference, col="green")

QQ−plot of difference QQ−plot of difference


Right−dominant individuals Left−dominant individuals

0
30

−2
Sample Quantiles

Sample Quantiles

−4
20

−6
10

−8
−10
0

−2 −1 0 1 2 −1.5 −0.5 0.5 1.5

Theoretical Quantiles Theoretical Quantiles


Both Right-hand dominant and Left-hand dominant individuals have clear non-normality in these plots. Both
show a strong curve and seem to be skewed.
b. [6 marks] Considering only the students who are Right-hand dominant, carry out a suitable
hypothesis test at the 0.05 level of significance to test if there is a difference between their right and left
hands.You should clearly state your hypotheses, test statistic, degrees of freedom (if relevant), p-value
and a precise conclusion in the context of the question.
ANSWER
# Can do manually:
# Manually doing the T test
x.bar <- mean(RDH$difference)
n <- length(RDH$difference)
SE <- sd(RDH$difference) / sqrt(n)

test.stat <- (x.bar - 0) / SE


df <- n - 1
c(test.stat, df) # Just so the test statistic and df are displayed

## [1] 8.534635 97.000000


pt(test.stat, df = df, lower.tail = FALSE)*2

## [1] 1.936863e-13
# These two commands are doing exactly the same thing. So, either is fine to run a paired t-test!
t.test(RDH$rHand, RDH$lHand, paired = TRUE)

4
##
## Paired t-test
##
## data: RDH$rHand and RDH$lHand
## t = 8.5346, df = 97, p-value = 1.937e-13
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## 3.351724 5.382970
## sample estimates:
## mean difference
## 4.367347
t.test(RDH$difference)

##
## One Sample t-test
##
## data: RDH$difference
## t = 8.5346, df = 97, p-value = 1.937e-13
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 3.351724 5.382970
## sample estimates:
## mean of x
## 4.367347

H0 : µdiff = 0
HA : µdiff ”= 0
Null hypothesis: there is no difference in the mean number of grain between the right and left hands among
right-hand dominant individuals. Alternative hypothesis: there is a difference in the mean measurement
between the right and left hands among right-hand dominant individuals. There is strong evidence of a
difference in rice grain measurements of dexterity in Right-hand dominant individuals (*t(df=97) = 8.53, p=
1.937e-13*). On average, students moved between 3.35 and 5.38 more grains with their right hand.
c. [2 marks] Explain why it is reasonable to conduct a hypothesis test (t-test) for the mean difference for
Right-hand dominant individuals, even if they are not normal.
ANSWER The Central Limit Theorem (CLT) will apply since the sample size is large (n = 98). The mean of
the differences will be approximately normally distributed in this case.
d. [3 marks] Conduct a suitable hypothesis test for the students who are left-hand dominant. In
your analysis, carefully justify your choice of test, including a detailed examination of the assumptions
required for the selected test. After conducting the test, you need to write a clear conclusion that is
supported by relevant evidence. This conclusion should summarise the findings and explain how the
evidence supports these outcomes.
ANSWER As the data for left-hand dominant individuals is not normally distributed and the sample size is
small, we can use a non-parametric test to compare the difference.
#non-parametric test to compare mean difference
wilcox.test(LDH$difference)

##
## Wilcoxon signed rank exact test
##
## data: LDH$difference

5
## V = 1, p-value = 0.0009766
## alternative hypothesis: true location is not equal to 0
Given the non-normality in the small sample of left-hand dominant individuals, a non-parametric test was
considered appropriate. Left-hand dominant individuals showed strong evidence of differences in the grains
moved by each hand (Wilcoxon test V=1, P=0.00098), moving more grains with their left hand.

Problem 3: Testing DCE1 Data [9 marks]


In this problem you will use the ‘demographics’ data that we collected from each of you right at the start of
semester (DCE1). The data set has 179 rows; each row contains answers to each question from one student.
The file name for this dataset is “DCE12024.csv”. There is one row for each student that responded to the
quiz:PredictFinal is the mark the student nominated at the beginning of the course as their likely mark;
Languages is the number of languages the student speaks; DominantHand is the student’s dominant hand.
Student is whether students are domestic or international based on their country of origin.
demographics <- read.csv(file = "DCE12024.csv")

a. [5 marks] We are interested in whether there is an association between the number of languages spoken
and where people are from (metropolitan or regional). Treating Languages as a categorical variable,
carry out an appropriate hypothesis test. You will need to ensure that you meet the assumptions for
this test. In your answer:
• state your null hypothesis and your alternative hypothesis;
• list the assumptions and show they are met;
• state the test statistic, its distribution, and report the p-value
• describe in plain English what the results mean.
ANSWER
# extract relevant vectors
Location.Languages <- demographics[,c(3,7)]

# Convert into a matrix


tab1 <- table(Location.Languages)

# Find totals first:


TotRow <- rowSums(tab1)
TotCol <- colSums(tab1)
GrandTot <- sum(tab1)

# Expected values
#Metropolitan
TotCol*(TotRow[1])/GrandTot

## 1 2 3 4
## 72.4692737 40.9608939 26.7821229 0.7877095
#Regional
TotCol*(TotRow[2])/GrandTot

## 1 2 3 4
## 19.5307263 11.0391061 7.2178771 0.2122905
## OR you can use the shortcut: this will warn us of the issue.
chisq.test(tab1)

## Warning in chisq.test(tab1): Chi-squared approximation may be incorrect


##

6
## Pearson s Chi-squared test
##
## data: tab1
## X-squared = 1.7909, df = 3, p-value = 0.6169
\textcolor{violet}{This warns us because we have failed the assumption for the ‰2 test. Essentially, we need
enough data, so we want 80% of expected values > 5 and all expected values > 1. So, we’ll change to “3 or
more languages”, which should fix it.}
Location.Languages.Modified <- Location.Languages
Location.Languages.Modified["Languages"][Location.Languages.Modified["Languages"] >= 3] <- 3

# Convert into a matrix


tab2 <- table(Location.Languages.Modified)

# Find totals
TotRow2 <- rowSums(tab2)
TotCol2 <- colSums(tab2)
GrandTot2 <- sum(tab2)

# Find Expected values to check assumption again:


#Metropolitan
TotCol2*(TotRow2[1])/GrandTot2

## 1 2 3
## 72.46927 40.96089 27.56983
#Regional
TotCol2*(TotRow2[2])/GrandTot2

## 1 2 3
## 19.530726 11.039106 7.430168
## Now just do the test:
chisq.test(tab2)

##
## Pearson s Chi-squared test
##
## data: tab2
## X-squared = 1.61, df = 2, p-value = 0.4471

H0 : no association between where students are from and number of languages spoken
H1 : there is an association between where students are from and number of languages spoken

The only assumption is that there is sufficient data, which we have already detailed above and shown it is
now satisfied by the second test. Based on this sample, no association was found between where students are
from (metropolitan/regional) and the number of languages spoken (‰2 (df = 2) = 1.61, p = 0.447).
b. [4 marks] We are interested in whether there is a difference between the domestic and international
students predicted final mark. Run an analysis of variance to test whether there is significant difference
between the mean of predicted final mark for international and domestic students. Use PredictFinal
and Studentcolumns in DCE12024 data set to run this analysis. Include your codes and the
outputs.
(i) What were the hypotheses being tested here?
(ii) Explain (in details) what does these results indicate?

7
ANSWERCan do this with either a t-test or a linear model, depending on whether they meet the assumptions
(a linear model requires equal variances, this is optional for a t-test).
#Using a t-test

t.test(PredictFinal~Student, data=demographics, var.equal=TRUE)

##
## Two Sample t-test
##
## data: PredictFinal by Student
## t = -5.3106, df = 177, p-value = 3.248e-07
## alternative hypothesis: true difference in means between group Domestic and group International is no
## 95 percent confidence interval:
## -8.991963 -4.119627
## sample estimates:
## mean in group Domestic mean in group International
## 72.79508 79.35088
#Using a linear model
demo.model <- lm(PredictFinal~Student, data=demographics)

par(mfrow=c(1,2))
plot(demo.model,which=1)
plot(demo.model,which=2)

Residuals vs Fitted Q−Q Residuals


3
20

2
Standardized residuals
10

1
Residuals

0
−10

−1
−2
−20

110
95
67
−3

6795
110

73 75 77 79 −2 −1 0 1 2

Fitted values Theoretical Quantiles


summary(demo.model)

##
## Call:
## lm(formula = PredictFinal ~ Student, data = demographics)
##
## Residuals:

8
## Min 1Q Median 3Q Max
## -22.7951 -4.3509 0.6491 5.6491 19.6491
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 72.7951 0.6966 104.499 < 2e-16 ***
## StudentInternational 6.5558 1.2345 5.311 3.25e-07 ***
## ---
## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
##
## Residual standard error: 7.694 on 177 degrees of freedom
## Multiple R-squared: 0.1374, Adjusted R-squared: 0.1326
## F-statistic: 28.2 on 1 and 177 DF, p-value: 3.248e-07
#You can also get an anova table for the model
anova(demo.model)

## Analysis of Variance Table


##
## Response: PredictFinal
## Df Sum Sq Mean Sq F value Pr(>F)
## Student 1 1669.7 1669.7 28.203 3.248e-07 ***
## Residuals 177 10478.9 59.2
## ---
## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

9
MAST20031 Analysis of Biological Data - Assignment 1
Solutions
Tara and Paul

Table of Contents

Problem 0 [1 mark]
Participate in the data collection exercise. This was automatically recorded.
d <- read.csv("DCE2_2024.csv")

Problem 1 [14 marks]


a. [2 marks] Compute and report numerical summaries on 𝑮 (each summary
containing min, max, 1st, 2nd and 3rd quartiles, mean and standard deviation)
for each hand (left and right), separated by the handedness of the student.
There are many ways to create the subsets in this question that would result in the same
five number summaries.
# First separate by handedness:

# Option 1: use subset() function


left <- subset(d, d$dominantHand == "Left")
right <- subset(d, d$dominantHand == "Right")

# Option 2: use indexing

# NOTE: There are many other ways to do this, e.g. using na.omit()
# The summaries would also be correct even if the NA values were left in,
# but there would also be an "NA" column.
left <- d[d$dominantHand=="Left" & !is.na(d$dominantHand), ]
right <- d[d$dominantHand=="Right" & !is.na(d$dominantHand), ]

# Dominant hand = left


summary(left[left$Hand=="L","nGrains"])

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 16.00 23.00 25.00 27.28 31.50 46.00
summary(left[left$Hand=="R","nGrains"])

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 14.00 19.00 22.00 22.15 25.00 31.00

# Dominant hand = right


summary(right[right$Hand=="L","nGrains"])

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 5.00 18.00 21.00 21.58 24.00 42.00

summary(right[right$Hand=="R","nGrains"])

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 14.00 21.00 24.00 25.81 30.00 50.00

b. [3 marks] For Left-dominant individuals, produce and report a paired box plot
(two boxplots within a single graph) for 𝑮 vs Hand. Also make one of these paired
plots for Right-dominant individuals. Label the graphs appropriately.
par(mfrow=c(2,1))

boxplot(left$nGrains~left$Hand,
ylab="Hand used",xlab="Number of Grains",main="Grains of rice moved
by left-handed individuals",
ylim=c(0,50), horizontal=TRUE)

boxplot(right$nGrains~right$Hand,
ylab="Hand used", xlab="Number of Grains",main="Grains of rice moved
by right-handed individuals",
ylim=c(0,50), horizontal=TRUE)
c. [4 marks] Based on (a) and (b), briefly comment on differences or similarities
in terms of centre and spread of the distributions of number of grains for
dominant hand compared to non-dominant hand. Provide a biological
justification to your findings.
With some skew and some outliers, the median/IQR are better measures of centre/spread.
Even if a few groups seem symmetric, it doesn’t make sense to compare a standard
deviation to an IQR, for example.

ANSWER
For left handed people, it seems like the left hand is slightly better at manipulating the rice
grains (Median: L=25, R=22). Similarly for the right handed people, the dominant hand has
a higher typical value (Median: L=21, R=24).
There is a wider spread of data in the left-hand trials than the right-hand trials for the left
handed people (IQR: L=8.5, R=6). Right-handed people also showed more variability in
their dominant hand (IQR: L=6, R=9). Overall the variability is higher in right-handed
people than in left-handed people.
Possible Biological explanations:
• People use their dominant hand more frequently, and this may lead to greater
dexterity (consistent with slight difference between the dominant and non-
dominant hands in both groups)
……….

d. [2 marks] Potential outliers are observations that fall outside some pre-
determined ``fences’’, typically 𝟏. 𝟓 × 𝑰𝑸𝑹 beyond 𝑸𝟏 and 𝑸𝟑 . The number of
outliers in these data is given below. Give a possible biological explanation for
the pattern in the observed outliers.
Number of low outliers Number of high outliers
Left hand dominant Left 0 1
Right 0 0
Right hand dominant Left 2 11
Right 0 8

ANSWER
Most outliers occurred in right-handed group, and most were unusually high values. A
possible biological explanation could be that some right-handed people can be equally
dexterity in both hands (e.g. because they routinely perform/practice tasks which require
both hands, such as playing the piano). Moreover, some right-handed people might be
extremely bad at using their non-dominant (left) hand, so we also got some very low outlier
in right-handed group when use their non-dominant hand.
……….

e. [3 marks] Use the qqnorm and qqline commands to produce normal QQ-plots
for 𝑮, for each combination of hand and dominant hand (a total of four plots).
Does the normal distribution seem a good statistical model for these data? Why
or why not?
ANSWER
# Step 1. Name the subsets
LeftyL <- left[left$Hand=="L","nGrains"]
LeftyR <- left[left$Hand=="R","nGrains"]

RightyL <- right[right$Hand=="L","nGrains"]


RightyR <- right[right$Hand=="R","nGrains"]

par(mfrow=c(2,2))

# Left hand dominant: Left hand


qqnorm(LeftyL, main = "Normal QQ (Left dominant: Left hand)")
qqline(LeftyL)

# Left hand dominant: Right hand


qqnorm(LeftyR, main = "Normal QQ (Left dominant: Right hand)")
qqline(LeftyR)

# Right hand dominant: Left hand


qqnorm(RightyL, main = "Normal QQ (Right dominant: Left hand)")
qqline(RightyL)

# Right hand dominant: Right hand


qqnorm(RightyR, main = "Normal QQ (Right dominant: Right hand)")
qqline(RightyR)
The normal distribution only appears to be a reasonable fit for the Left-hand dominant:
right hand group. The left-hand dominant: left hand trial subset does not appear normal,
with a a few deviation through out the curve in QQ plot. Both right-hand dominant groups
appear to be non-normal due to a deviation on the upper tail.
……….

Problem 2 [6 marks]
a. [3 marks] Consider just the left-handed individuals. We want to estimate the
population standard deviation in the number of grains (overall, ignoring which
hand was being used). Write some code to calculate a 95% bootstrap confidence
interval for the population standard deviation σ. Add comments to explain what
your code is doing; you will be assessed on both your code and your comments
NOTE: you must write some code, not use other functions and/or packages to do this.
ANSWER
original.data <- left$nGrains

# Bootstrap to determine bCI for sigma (standard deviation)


i <- 1:1000 # We'll do this 1000 times
bootstrap <- c() # Start with no bootstrap sample values
for (val in i) { # Keep doing this
x <- sample(original.data, size=length(original.data), replace = TRUE)
# Sample with replacement, same sample size as original data
bootstrap <- c(bootstrap, round(sd(x),2))
# Calculate sample standard deviation and add to the list of bootstrap
values
}
quantile(bootstrap,0.025) # Take the lower endpoint (0.025 quantile for 95%
interval)
quantile(bootstrap,0.975) # Take the upper endpoint (0.975 quantile for 95%
interval)

## 2.5%
## 5.12975
## 97.5%
## 7.8205

……….

b. [3 marks] You are conducting an experiment on bacterial growth, and have


measured 80 petri dishes. You are interested in the area covered by the
bacterial growth, which does not appear to be normally distributed. For which
of the following parameters would it be useful to calculate a bootstrap
Confidence Interval (bCI)? You will be assessed on your reasons, not just the
answers.
i. Population mean, $\mu$.

ii. Population median.

iii. Population standard deviation, $\sigma$

ANSWER
It makes sense to calculate a bCI for the median and standard deviation, because these
don’t follow normal distributions, typically. You could calculate one for the mean, too, as
bootstrap intervals are always an option. However, if there is a large sample size (applies to
this problem with n = 80), then even if the original data are not normal then the central
limit theorem (CLT) means that a standard CI for the mean 𝜇 will be appropriate, and hence
better than a bCI.
……….

Problem 3 [6 marks]
In the final problem we will examine whether there is a learning effect in our data - do
students get better at this task with each trial?

a. [1 mark] Plot four histograms: the number of grains across all trials, and
separately for each trial. Use the arguments xlim=c(0, 50), breaks=25 for all
plots.

ANSWER
First create the subsets
TrialAll <- d$nGrains
Trial1 <- d$nGrains[d$Replicate==1]
Trial2 <- d$nGrains[d$Replicate==2]
Trial3 <- d$nGrains[d$Replicate==3]

Now create the plots


par(mfrow=c(2,2))

hist(TrialAll,
xlim=c(0, 50), breaks=25,
main="Histogram for all trials", xlab="Number of Grains")

hist(Trial1,
xlim=c(0, 50), breaks=25,
main="Histogram for Trial 1", xlab="Number of Grains")

hist(Trial2,
xlim=c(0, 50), breaks=25,
main="Histogram for Trial 2", xlab="Number of Grains")

hist(Trial3,
xlim=c(0, 50), breaks=25,
main="Histogram for Trial 3", xlab="Number of Grains")
……….

b. [2 marks] Compute the sample mean, 𝒙‾, and sample standard deviation, 𝒔,
for the number of grains in the entire dataset (702 rows) and for the first,
second and third trial separately. Briefly compare and contrast the findings from
the original and separated data.

ANSWER

Mean SD
Overall 23.802 6.844
Trial 1 22.674 6.607
Trial 2 23.831 6.738
Trial 3 24.901 7.027

The distribution of the data is quite similar in all cases, with only slight differences in the
means and standard deviations.
The mean increases by approx. 1 grain on each trial (T1=22.7, T2=23.8, T3=24.9). However,
the standard deviation also increases very slightly with the trial.
In order to determine if there is a learning effect, it would be necessary to evaluate if these
differences between trials are significant.
……….

c. [3 marks] Compute the 95% confidence interval across the entire dataset and
for each trial separately. Does it seem like students are becoming better at this
task with each trial? Why or why not?

ANSWER
Although there is some evidence of positive skew, because the sample size is large the
central limit theorem applies and we can assume normality for calculating these intervals.
However, since we do not know the standard deviation of the population, it is best to use t
distribution.
𝐶𝐼 = 𝑥 ± 𝑡 × 𝑆𝐸

All
SE_All = sd(TrialAll)/sqrt(length(TrialAll))
tCrit_All = c(-1,1) * qt(0.975,(length(TrialAll)-1))

CI_All = mean(TrialAll) + tCrit_All*SE_All


CI_All

## [1] 23.30295 24.30036

Trial1
SE_T1 = sd(Trial1)/sqrt(length(Trial1))
tCrit_T1 = c(-1,1) * qt(0.975,(length(Trial1)-1))

CI_T1 = mean(Trial1) + tCrit_T1*SE_T1


CI_T1

## [1] 21.83697 23.51014

Trial2
SE_T2 = sd(Trial2)/sqrt(length(Trial2))
tCrit_T2 = c(-1,1) * qt(0.975,(length(Trial2)-1))

CI_T2 = mean(Trial2) + tCrit_T2*SE_T2


CI_T2

## [1] 22.97735 24.68381


Trial3
SE_T3 = sd(Trial3)/sqrt(length(Trial3))
tCrit_T3 = c(-1,1) * qt(0.975,(length(Trial3)-1))

CI_T3 = mean(Trial3) + tCrit_T3*SE_T3


CI_T3

## [1] 24.01097 25.79069

At first glance it seems like students are getting better since the interval gradually moves
towards larger numbers suggesting that more grains were handled during the same
amount of time. However, all these intervals overlap, thus we can not be sure that the true
means for the population on each trial are different. This analysis does not provide enough
evidence to sustain that there is a learning effect (students are getting better at this task
with each trial).
……….

You might also like