0% found this document useful (0 votes)

4 views

DH302 Spring2025 Assignment02-Solutions

The document outlines the solutions for DH 302 Spring 2025 Assignment 02, which is due on February 18, 2025, and consists of various problems related to statistical analysis and hypothesis testing. It includes instructions for submission, details on the assignment's structure, and specific problems that require theoretical explanations and data analysis using R. The assignment emphasizes the importance of adhering to submission guidelines and the implications of statistical findings in the context of the problems presented.

Uploaded by

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

DH302 Spring2025 Assignment02-Solutions

Uploaded by

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

DH 302 Spring 2025 Assignment 02 Solutions

The assignment is based on Lecture 7, Lecture 8, Lecture 9, and Lecture 10.

Due at 11:59PM (IST), Tuesday 18𝑡ℎ February, 2025 via Gradescope. Raw
template is available here and an upto date version of this pdf is available here.
Total points = 150

Solutions

Instructions

Submit your solutions via gradescope by 11:59 PM (IST) Tuesday, 18𝑡ℎ February 2025. In-
person submissions will not be entertained. Please upload a single PDF file. Late submissions
are allowed with a 10% per day penalty (only upto 22nd February). You can raise your
questions related to the assignment on Piazza - please tag these as assignment_02.

• For theory questions, you can either write your response for in latex or put a
screenshot/picture of your handwritten response in the appropriate section. To
embed scanned images, use this format: ![question1](/path/to/question1.png)
where /path/to/question1.png is the local path (on your laptop) to your scanned
(handwritten) response to question1.
• If you are writing the solutions for theory questions by hand please use a pen. Pencil
submissions are diﬀicult to read when scanned. You will have to take a scan for each
such answer and embed it in this document.
• Your final submission has to be the PDF that comes from this template - one single pdf.
No Exceptions.
• Please mention the name(s) of people you collaborated with and what exactly you dis-
cussed.

Making your submission: Raw template is available here and an upto date version of
this pdf is available here. Open the template in Rstudio (you will need to ensure Quarto is
installed). Once you are done with your answers, use the “render” (arrow like) button on the
toolbar to create a pdf. Only pdf submissions are allowed.

1
Problem 01 [25 points]

Quality of life: Improvement in quality of life was measured for a group of heart disease
patients after 8 weeks in an exercise program. This experimental group was compared to a
control group who were not in the exercise program. Quality of life was measured using a
21-item questionnaire, with scores of 1–5 on each item. The improvement data are as follows
and are plotted below.

Figure 1: Problem 1 image is available here

1a. What is the null hypothesis? [2.5 points]

Quality of life scores for the exercise group and the control group are the same.

1b. What conclusion could you draw from the dotplot? [2.5 points]
Quality of life scores for the exercise group are higher than the control group.

1c. Here is computer output for a t test. Explain what the P-value means in the context
of this study. [5 points]
t = 2.505, df = 33.23, p-value = 0.00866 alternative hypothesis: true difference in means is
greater than 0
The p-value is the probability of observing the statististic at least as extreme (2.505 or higher)
if the null hypothesis is true. In this case, the p-value is 0.00866, which is less than 0.05 and
hence we reject the null. The alternative is that the true difference in means is greater than 0
i.e. the quality of life improves in the exercise group.

2
Full points only if the statistic direction is talked about + the conclulsion is in the context of
the study.

1d. If type-1 error 𝛼 = 0.01, what is your conclusion regarding 𝐻0 ? State your conclusion
in the specific context of this problem. [5 points]
We reject the H0 since the p-value is 0.0086.

1e. The computer output in part (c) is for the directional test. What is the P-value for
the nondirectional test? [5 points]
The p-value for the nondirectional test is 0.00866*2 = 0.01732

1f. If the test were nondirectional, and 𝛼= 0.01, what conclusions would we make? [5
points]
We would fail to reject the null hypothesis.

Problem 02 [10 points]

Normality goes for a toss: Researchers took skin samples from 10 patients who had breast
implants and from a control group of 6 patients. They recorded the level of interleukin-6 or
IL6 in picogram/ml/10 g of tissue, a measure of tissue inflammation, after each tissue sample
was cultured for 24 hours. The dataset is available below (in R)

il6.breast.implant.patients <- c(
231, 308287, 33291, 124550, 17075,
22955, 95102, 5649, 840585, 58924
)
il6.control.patients <- c(35324, 12457, 8276, 44, 278, 840)
df.breast <- data.frame(value = il6.breast.implant.patients)
df.contorl <- data.frame(value = il6.control.patients)

2a. Draw a boxplot, violin and ridgeline plot [5 points]

# OUTPUT should be 3 plots

# While making it with the defaults but correctly will fetch you full points
# you can go the extra mile and show someone the creativ side of you
# good colors, good scaling, good lines, showing all data points, adding a legend
library(ggplot2)

3
library(ggridges)
library(patchwork)
library(tidyverse)
theme_set(ggpubr::theme_pubr())
df <- bind_rows(list(
`Breast implant` = df.breast,
`Control` = df.contorl
), .id = "group")

p1 <- ggplot(df, mapping = aes(x = group, y = value, color = group)) +

geom_boxplot(width = 0.4) +
geom_jitter() +
scale_color_brewer(palette = "Set1") +
ggtitle("Boxplot") +
scale_y_log10()
p2 <- ggplot(df, mapping = aes(x = group, y = value, color = group)) +
geom_violin() +
geom_boxplot(width = 0.1) +
geom_jitter() +
scale_color_brewer(palette = "Set1") +
ggtitle("Violin plot") +
scale_y_log10()
p3 <- ggplot(df, mapping = aes(x = value, y = group, fill = group)) +
stat_density_ridges(
quantile_lines = T,
quantiles = 2,
scale = 0.7
) +
scale_fill_brewer(palette = "Set1") +
ggtitle("Ridgeplot") +
scale_x_log10()

p1 | p2 | p3

Picking joint bandwidth of 0.519

4
2b. Draw a Q-Q plot for both the measurements [5 points]
You can use the geom_qq function

# YOUR CODE HERE FOR GENERATING A QQ PLOT

# FIX ME - replace this with something that will give you a qqplot
implant.qqplot <- ggplot(df.breast, aes(sample = value)) +
stat_qq() +
stat_qq_line() +
ggtitle("Breast implant")
# FIX ME - replace this with something that will give you a qqplot
contro.qqplot <- ggplot(df.breast, aes(sample = value)) +
stat_qq() +
stat_qq_line()
ggtitle("Control")

$title
[1] "Control"

attr(,"class")
[1] "labels"

5
# !! DO NOT EDIT/REMOVE !!
library(patchwork)
implant.qqplot | contro.qqplot

Problem 03 [10 points]

Sneak peek: I perform a t test of the null hypothesis that two means are equal. I decided to
calculate the means and then choose an alternative hypothesis 𝐻𝐴 𝜇1 > 𝜇2 because I observed
𝑦1̄ > 𝑦2̄ .

3a. Explain what is wrong (if anything) with this procedure and why it is wrong (if
anything). [5 points]
The null hypothesis and alternative hypothesis should be decided before the data is collected.
Deciding the alternative hypothesis based on the data is data ‘snooping’ and can lead to biased
results (p-value hijacking).

6
3b. Suppose I reported t = 1.97 on 25 degrees of freedom and a P-value of 0.03. What
is the proper P-value? [5 points]
The proper p-value is 0.03. The p-value is the probability of observing the statististic at least
as extreme (1.97 or higher) if the null hypothesis is true.

Problem 04 [25 points]

“What I cannot build, I cannot understand – Feynman”. We studied different t-tests

and summarised it here. This was like a recipe - do this when you have this ingredient, do that
if not. The goal of this problem is to flip this and figure out what happens if we try breaking
this assumption

4a. The defaults. Simulate two normal samples (n=200) with each mean = 10 and sd =
1, 10000 times. Apply a t-test for each iteration and calculate the p-value. With
𝛼 = 0.05, how many times do you reject the null hypothesis that the mean of two
samples is equal? What do you conclude? [2.5 points]

# # !! DO NOT EDIT/REMOVE !!
# OUTPUT: n_rejects -- Number of rejections for alpha=0.05

set.seed(42)
alpha <- 0.05
n_rejections <- 0
N_tries <- 10000
n_sample_size <- 200

for (i in seq(1, N_tries)) {

sample1 <- rnorm(n = n_sample_size, mean = 10, sd = 1)
sample2 <- rnorm(n = n_sample_size, mean = 10, sd = 1)
pval <- t.test(sample1, sample2)$p.value
if (pval <= alpha) {
n_rejections <- n_rejections + 1
}
}
n_rejections / N_tries

[1] 0.0511

7
n_rejections/N_tries: 0.05
Conclusion: We expect a type1 error of 0.05 which is also what we get - the null hypothesis
is rejected 5% of the time when it is true.

4b. Unequal variance. Simulate two normal samples (n=100) with each mean = 10 and
sd1 = 1 and sd2=2.5, 10000 times. Apply a t-test for each iteration and calculate the
p-value. With 𝛼 = 0.05, how many times do you reject the null hypothesis that the mean
of two samples is equal? What do you conclude and is that unusual? Use the default
t.test [2.5 points]

# OUTPUT: n_rejects -- Number of rejections for alpha=0.05

set.seed(42)
alpha <- 0.05
n_rejections <- 0
N_tries <- 10000
n_sample_size <- 200
mean1 <- 10
mean2 <- mean1

sd1 <- 1
sd2 <- 2.5
for (i in seq(1, N_tries)) {
sample1 <- rnorm(n = n_sample_size, mean = mean1, sd = sd1)
sample2 <- rnorm(n = n_sample_size, mean = mean2, sd = sd2)
pval <- t.test(sample1, sample2)$p.value
if (pval <= alpha) {
n_rejections <- n_rejections + 1
}
}
n_rejections / N_tries

[1] 0.0519

n_rejections/N_tries: 0.0519
Conclusion: Even by violoating the the equal variance assumption, the type 1 error rate
is still close to 0.05. One reason this happens here because t.test() by default uses
var.equal=FALSE.

8
4c. Unequal variance revisited: Repeat the example in 4b with now
var.equal=FALSE:t.test(var.equal=TRUE). With 𝛼 = 0.05, how many times do you
reject the null hypothesis that the mean of two samples is equal? What do you
conclude? [2.5 points]

# OUTPUT: n_rejects -- Number of rejections for alpha=0.05

set.seed(42)
alpha <- 0.05
n_rejections <- 0
N_tries <- 10000
n_sample_size <- 200
mean1 <- 10
mean2 <- mean1

sd1 <- 1
sd2 <- 2.5
for (i in seq(1, N_tries)) {
sample1 <- rnorm(n = n_sample_size, mean = mean1, sd = sd1)
sample2 <- rnorm(n = n_sample_size, mean = mean2, sd = sd2)
pval <- t.test(sample1, sample2, var.equal = TRUE)$p.value
if (pval <= alpha) {
n_rejections <- n_rejections + 1
}
}
n_rejections / N_tries

[1] 0.0523

n_rejections/N_tries: 0.05
Conclusion: The type1 error rate is still close to 0.05 despite forcing the equal variance
assumption. This is because the sample size is large across both samples and the t-distribution
is robust to the assumption of equal variance.

4d. Severe violation: Following 4c, now set sd1=10, sd2=1, and use a non-welch t-test to
tabulate the number of times you reject the null? With 𝛼 = 0.05, how many times do you
reject the null hypothesis that the mean of two samples is equal? What do you conclude?
[2.5 points]

9
# OUTPUT: n_rejects -- Number of rejections for alpha=0.05

set.seed(42)
alpha <- 0.05
n_rejections <- 0
N_tries <- 10000
n_sample_size <- 200

mean1 <- 10
mean2 <- mean1

sd1 <- 10
sd2 <- 1
for (i in seq(1, N_tries)) {
sample1 <- rnorm(n = n_sample_size, mean = mean1, sd = sd1)
sample2 <- rnorm(n = n_sample_size, mean = mean2, sd = sd2)
pval <- t.test(sample1, sample2, var.equal = TRUE)$p.value
if (pval <= alpha) {
n_rejections <- n_rejections + 1
}
}
n_rejections / N_tries

[1] 0.0527

n_rejections/N_tries: 0.0527
Conclusion: The type1 error rate is still close to 0.05 despite forcing the equal variance
assumption. This is because the sample size is large across both samples and the t-distribution
is robust to the assumption of equal variance.

4e. Severe violation2: Following 4d, now simulate different sample sizes with
n_sample_size1 =30 and n_sample_size2=70, sd1=10, sd2=1, and use a non=welch
t-test to tabulate the number of times you reject the null? With 𝛼 = 0.05, how many
times do you reject the null hypothesis that the mean of two samples is equal? What do
you conclude? [2.5 points]

# OUTPUT: n_rejects -- Number of rejections for alpha=0.05

set.seed(42)
alpha <- 0.05

10
n_rejections <- 0
N_tries <- 10000
n_sample_size1 <- 30
n_sample_size2 <- 70
mean1 <- 10
mean2 <- mean1

sd1 <- 10
sd2 <- 1
for (i in seq(1, N_tries)) {
sample1 <- rnorm(n = n_sample_size1, mean = mean1, sd = sd1)
sample2 <- rnorm(n = n_sample_size2, mean = mean2, sd = sd2)
pval <- t.test(sample1, sample2, var.equal = TRUE)$p.value
if (pval <= alpha) {
n_rejections <- n_rejections + 1
}
}
n_rejections / N_tries

[1] 0.2068

n_rejections/N_tries: 0.26
Conclusion: The pvalue is now 0.26, despite the original samples being very different. This
is because of difference in sample sizes across the two populations. t-test is more or less robust
to assumption of equal variance, but not to sample size differences.

4f. Severe violation3: Repeat 4e with n_sample_size1 = 70 and n_sample_size1=70,

sd1=10, sd2=1, and use a non-welch t-test to tabulate the number of times you reject
the null? With 𝛼 = 0.05, how many times do you reject the null hypothesis that the mean
of two samples is equal? What do you conclude? [2.5 points]
There was a typo in the original question so free points. What I wanted to similate was
n_sample_size2=30 and n_sample_size1=70.

# OUTPUT: n_rejects -- Number of rejections for alpha=0.05

set.seed(42)
alpha <- 0.05
n_rejections <- 0
N_tries <- 10000
n_sample_size1 <- 70

11
n_sample_size2 <- 30
mean1 <- 10
mean2 <- mean1

[1] 0.0043

This is a very conservative pvalue, i.e. if the sample size and sample variance are both high,
the t-test is very conservative and the type1 error rate is very low.

4g. Severe violation4: Repeat 4f with n_sample_size1 = 70 and n_sample_size2=30,

sd1=10, sd2=1, and use a Welch t-test to tabulate the number of times you reject the
null? With 𝛼 = 0.05, how many times do you reject the null hypothesis that the mean of
two samples is equal? What do you conclude? [2.5 points]

# OUTPUT: n_rejects -- Number of rejections for alpha=0.05

set.seed(42)
alpha <- 0.05
n_rejections <- 0
N_tries <- 10000
n_sample_size1 <- 70
n_sample_size2 <- 30
mean1 <- 10
mean2 <- mean1

sd1 <- 10
sd2 <- 1
for (i in seq(1, N_tries)) {
sample1 <- rnorm(n = n_sample_size1, mean = mean1, sd = sd1)

12
sample2 <- rnorm(n = n_sample_size2, mean = mean2, sd = sd2)
pval <- t.test(sample1, sample2, var.equal = FALSE)$p.value
if (pval <= alpha) {
n_rejections <- n_rejections + 1
}
}
n_rejections / N_tries

[1] 0.0493

n_rejections/N_tries: 0.0493
Conclusion: The pvalue is now 0.05, despite the original samples being very different. Welch
test handles the difference in sample sizes and variances well.

4h. Toss in exponential: Hopefully you have got a feeling of what is happening. Now we
take normality for a toss. I have mentioned multiple times in the class, it is a relaxable
assumption, but is it really? [2.5 points]
For example if we simulate an exponential distribution this is what it looks like

exp_data <- rexp(100, rate = 5)

qqnorm(exp_data)
qqline(exp_data)

13
For a n_sample_size1 = 70 and n_sample_size1=n_sample_size2=100, rate parameters r1=5
and r2=5 use a t-test to tabulate the number of times you reject the null? With 𝛼 = 0.05, how
many times do you reject the null hypothesis that the mean of two samples is equal? What
do you conclude?

# OUTPUT: n_rejects -- Number of rejections for alpha=0.05

set.seed(42)
alpha <- 0.05
n_rejections <- 0
N_tries <- 10000
n_sample_size1 <- 100
n_sample_size2 <- n_sample_size1
rate1 <- 5
rate2 <- 5

for (i in seq(1, N_tries)) {

sample1 <- rexp(n = n_sample_size1, rate = rate1)
sample2 <- rexp(n = n_sample_size2, rate = rate2)
pval <- t.test(sample1, sample2, var.equal = FALSE)$p.value
if (pval <= alpha) {

14
n_rejections <- n_rejections + 1
}
}
n_rejections / N_tries

[1] 0.0495

n_rejections/N_tries: 0.0495
Conclusion: The loss of normality is not a big deal for the t-test. The type1 error rate is
still close to 0.05.

4i. Toss in exponential2: Following 4h, repeat the experment with sample_size1=n_sample_size2=100,
rate parameters r1=5 and r2=10 use a t-test to tabulate the number of times you reject
the null? With 𝛼 = 0.05, how many times do you reject the null hypothesis that the mean
of two samples is equal? What do you conclude? [5 points]

# OUTPUT: n_rejects -- Number of rejections for alpha=0.05

set.seed(42)
alpha <- 0.05
n_rejections <- 0
N_tries <- 10000
n_sample_size1 <- 100
n_sample_size2 <- n_sample_size1
rate1 <- 5
rate2 <- 10

for (i in seq(1, N_tries)) {

sample1 <- rexp(n = n_sample_size1, rate = rate1)
sample2 <- rexp(n = n_sample_size2, rate = rate2)
pval <- t.test(sample1, sample2, var.equal = FALSE)$p.value
if (pval <= alpha) {
n_rejections <- n_rejections + 1
}
}
n_rejections / N_tries

[1] 0.9982

Ahh, the error rate is now ~1. We lose power - ability to detect differences when they exist if
the data is not approximately normal.

15
Problem 05 [30 points]

Standard error: Adults who broke their wrists were tested to see if their grip strength (kg)
decreased over 6 weeks. The data is provided as a tsv here.

5a. What is the null and alternative hypothesis? [2.5 points]

Null hypothesis: The grip strength of adults who broke their wrists does not decrease over
6 weeks.
Alternative hypothesis: The grip strength of adults who broke their wrists decreases over
6 weeks.

5b. Which test would you employ to test your hypothesis and why? [2.5 points]
Paired t-test since the same subjects are measured at two different time points.
No points for any other answer

5c. If there are conditions to be satisfied for applying your test, write R code to output
checks for validity of your test. [5 points]

• test for normality

# CODE for validity of your test of choice

# INPUT: data frame of subject based grip strengths
# OUTPUT: plots or metrics for assessing the validity of your test

df <- read_tsv("Assignment02_grip_strength.tsv")

Rows: 20 Columns: 3
-- Column specification --------------------------------------------------------
Delimiter: "\t"
dbl (3): subject, baseline, measured_6weekslater

i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.

qqnorm(df$baseline)
qqline(df$baseline)

16
For the 6 weeks measurement:

qqnorm(df$measured_6weekslater)
qqline(df$measured_6weekslater)

17
5d. Perform a test (using R) to test your hypothesis in a) reporting p-value and the
conclusion. [5 points]

t.test(x = df$measured_6weekslater, df$baseline, paired = T, var.equal = F, alternative = "le

Paired t-test

data: df$measured_6weekslater and df$baseline

t = 7.7499, df = 19, p-value = 1
alternative hypothesis: true mean difference is less than 0
95 percent confidence interval:
-Inf 10.34145
sample estimates:
mean difference
8.455

p-value: 1

18
Conclusion: We fail to reject the null hypothesis based on a directional paired t-test (i.e. un-
der the given alternative hypothesis). Note that the direction of the effect is exactly opposite
to what we expected (which is why the pvalue is 1).

5e. Non parameteric test [5 points]

A class of tests we did not discuss in detail in class are non-parameteric tests. We did discuss
them in this slide. Using hints from the slide, perform a relevant non-parameteric test (using
R) for testing your hypothesis in a) and report the pvalue and conclusion
We will do a wilcoxon test

wilcox.test(x = df$measured_6weekslater, df$baseline, paired = T, alternative = "less")

Wilcoxon signed rank exact test

data: df$measured_6weekslater and df$baseline

V = 210, p-value = 1
alternative hypothesis: true location shift is less than 0

Our results remain unchanged.

5f. Transform the original variable by a log() transformation and repeat your analysis in
5c and 5d [10 points]
We perform a log transformation by first adding 1 and then transofmring under log.

df2 <- df
df2$baseline <- log(1 + df$baseline)
df2$measured_6weekslater <- log(1 + df$measured_6weekslater)

qqnorm(df2$baseline)
qqline(df2$baseline)

19
qqnorm(df2$measured_6weekslater)
qqline(df2$measured_6weekslater)

20
t.test(x = df2$measured_6weekslater, df2$baseline, paired = T, var.equal = F, alternative = "

Paired t-test

data: df2$measured_6weekslater and df2$baseline

t = 6.3813, df = 19, p-value = 1
alternative hypothesis: true mean difference is less than 0
95 percent confidence interval:
-Inf 1.235259
sample estimates:
mean difference
0.971904

p-value: 1
Question: how does the result of your analysis compare with your conclusion in d?
Conclusion: The t statistic is now smaller, but the overall conclusion is unchanged. We fail
to reject the null hypotheis.

21
Problem 06 [50 points]

PPT - Principal proteins test: This is a data heavy question and designed to give you
another exposure to real world data which is messy, hard to parse and often not so well
documented. It is also to tell you how a random (or not so random) twitter/X thread can be
turned into an assignment problem - which is one of the reasons the assignment was delayed
(the other one being figuring out how to combat ChatGPT usage for direct copy pasting
answers (not code) ). It is also to applaud the government of India for the wealth of data it
collects - you just need to find(parse) it.
The tsv here has the amino acid content of different food items. The data was programmatically
extracted from this PDF and contains the amino acid profile of various food items - some
familar and some not so familiar ones. Very few people (in the world) have taken a look at
this data in the way you are going to.
Using dimensionality reduction techniques we studied in the class, perform an exploratory
analysis and build a story around your plot. While the question is broad, there are full points
only for specific answers. The questions are broad, but your responses should be specific.
GENERAL RUBRIC: If the plot looks exactly like the PCA plot above give extra +5
points. No penalty for including the additional columns.

6a. The plot. Plot the output of PCA to demonstrate what the lower dimensional
representation of the data looks like. [10 points]
Remember the dataset is messy, if you want to do dimensionality reduction, you want to only
retain numeric columns. While the original tsv also has uncertainity, to make the task easier
I have generated a cleaned version of the tsv here removing the uncertainity values. There are
21 columns in total:

1. food_code = a short code for different food items,

2. food_name = long description of the food
3. number_of_regions = Number of sampling regions (this can be IGNORED for this
question)
4. 18 columns corresponding to the 18 out of 20 amino acids with absolute quantities for
each

# YOUR CODE GOES HERE

# OUTPUT: A reduced dimensional representation of your data
# You can choose to color your data points on a specific variable of choice
# the variable can either exist in data or can be extracted(appended) by
# processing the data frame
df <- read_tsv("Table8_amino_acid_profile_no_uncertainity.tsv")

22
Rows: 612 Columns: 39
-- Column specification --------------------------------------------------------
Delimiter: "\t"
chr (2): food_code, food_name
dbl (37): number_of_regions, Alanine, Arginine, Aspartic Acid, Glutamic Acid...

i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.

df <- df %>%
drop_na() %>%
unique() %>%
as.data.frame()
colnames(df)

[1] "food_code" "food_name" "number_of_regions"

[4] "Alanine" "Arginine" "Aspartic Acid"
[7] "Glutamic Acid" "Glycine" "Proline"
[10] "Serine" "Tyrosine" "Histidine"
[13] "Isoleucine" "Leucine" "Lysine"
[16] "Methionine" "Cystine" "Phenylalanine"
[19] "Threonine" "Tryptophan" "Valine"
[22] "Aspartic Acid_std" "Histidine_std" "Lysine_std"
[25] "Glutamic Acid_std" "Methionine_std" "Phenylalanine_std"
[28] "Valine_std" "Proline_std" "Cystine_std"
[31] "Glycine_std" "Arginine_std" "Leucine_std"
[34] "Threonine_std" "Tryptophan_std" "Alanine_std"
[37] "Tyrosine_std" "Serine_std" "Isoleucine_std"

Note that the last few columns are the std deviation columns. You should remove these. If
not -2.5 points.

df <- df[, 1:21]

df$number_of_regions <- NULL
df$food_code2 <- substr(df$food_code, 1, 1)
df.master <- df
df$food_code <- NULL
df$number_of_regions <- NULL
rownames(df) <- df$food_name
df$food_name <- NULL
df$food_code2 <- NULL
df.orig <- df

23
pca <- prcomp(df %>% as.matrix(), scale. = T)
my_colors <- c(
"#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd",
"#8c564b", "#e377c2", "#7f7f7f", "#bcbd22", "#17becf",
"#aec7e8", "#ffbb78", "#98df8a", "#ff9896", "#c5b0d5",
"#c49c94", "#f7b6d2", "#c7c7c7", "#dbdb8d", "#9edae5"
)

pca.x <- pca$x %>% as.data.frame()

pca.x$food_name <- rownames(pca.x)
pca.x$food_code2 <- df.master$food_code2
var.explained <- 100 * pca$sdev^2 / sum(pca$sdev^2)
ggplot(pca.x, aes(x = PC1, y = PC2, color = food_code2)) +
geom_point() +
scale_color_manual(values = my_colors) +
xlab(paste0("PC1 (", round(var.explained[1], 2), "%)")) +
xlab(paste0("PC2 (", round(var.explained[2], 2), "%)"))

6b. The story Do you notice any discernible pattern in your data? What are some factor(s)
that explain your PC1 and PC2? [10 points]

24
Full points for guessing the animal vs plant protein divide.
We can probe this further by splitting the x axis (a bit arbitrarily but from the above plot)

df1 <- pca.x %>% filter(PC1 < 0)

table(df1$food_code2)

0 A B C D E G H J L N O P R S
1 5 5 10 3 7 2 2 4 3 14 63 89 2 10

what is O and P?

df.master %>%
filter(food_code2 %in% c("O", "P")) %>%
pull(food_name)

[1] "Goat, shoulder"

[2] "Goat, chops"
[3] "Goat, legs"
[4] "Goat, brain"
[5] "Goat, tongue"
[6] "Goat, lungs"
[7] "Goat, heart"
[8] "Goat, liver"
[9] "Goat, tripe"
[10] "Goat, spleen"
[11] "Goat, kidneys"
[12] "Goat, tube (small intestine)"
[13] "Goat, testis"
[14] "Sheep, shoulder"
[15] "Sheep, chops"
[16] "Sheep, leg"
[17] "Sheep, brain"
[18] "Sheep, tongue"
[19] "Sheep, lungs"
[20] "Sheep, heart"
[21] "Sheep, liver"
[22] "Sheep, tripe"
[23] "Sheep, spleen"
[24] "Sheep, kidneys"
[25] "Beef, shoulder"

25
[26] "Beef, chops"
[27] "Beef, round (leg)"
[28] "Beef, brain"
[29] "Beef, tongue"
[30] "Beef, lungs"
[31] "Beef, heart"
[32] "Beef, liver"
[33] "Beef, tripe"
[34] "Beef, spleen"
[35] "Beef, kidneys"
[36] "Calf, shoulder"
[37] "Calf, chops"
[38] "Calf, round (leg)"
[39] "Calf, brain"
[40] "Calf, tongue"
[41] "Calf, heart"
[42] "Calf, liver"
[43] "Calf, spleen"
[44] "Calf, kidneys"
[45] "Mithun, shoulder"
[46] "Mithun, chops"
[47] "Mithun, round (leg)"
[48] "Pork, shoulder"
[49] "Pork, chops"
[50] "Pork, ham"
[51] "Pork, lungs"
[52] "Pork, heart"
[53] "Pork, liver"
[54] "Pork, stomach"
[55] "Pork, spleen"
[56] "Pork, kidneys"
[57] "Pork, tube (small intestine)"
[58] "Hare, shoulder"
[59] "Hare, chops"
[60] "Hare, leg"
[61] "Rabbit, shoulder"
[62] "Rabbit, chops"
[63] "Rabbit, leg"
[64] "Allathi (Elops machnata)"
[65] "Aluva (Parastromateus niger)"
[66] "Anchovy (Stolephorus indicus)"
[67] "Ari fish (Aprion virescens)"
[68] "Betki (Lates calcarifer)"

26
[69] "Black snapper (Macolor niger)"
[70] "Bombay duck (Harpadon nehereus)"
[71] "Bommuralu (Muraenesox cinerius)"
[72] "Cat fish (Tachysurus thalassinus)"
[73] "Chakla (Rachycentron canadum)"
[74] "Chappal (Aluterus monoceros )"
[75] "Chelu (Elagatis bipinnulata)"
[76] "(Lutjanus quinquelineatus) Chembali"
[77] "Eri meen (Pristipomoides filamentosus)"
[78] "Gobro (Epinephelus diacanthus)"
[79] "Guitar fish (Rhinobatus prahli)"
[80] "Hilsa (Tenualosa ilisha)"
[81] "Jallal (Arius sp.)"
[82] "Jathi vela meen (Lethrinus lentjan)"
[83] "Kadal bral (Synodus indicus)"
[84] "Kadali (Nemipterus mesoprion)"
[85] "Kalamaara (Leptomelanosoma indicum)"
[86] "Kalava (Epinephelus coioides)"
[87] "Kanamayya (Lutjanus rivulatus)"
[88] "Kannadi paarai (Alectis indicus)"
[89] "Karimeen (Etroplus suratensis)"
[90] "Karnagawala (Anchoa hepsetus)"
[91] "Kayrai (Thunnus albacores)"
[92] "Kiriyan (Atule mate)"
[93] "Kite fish (Mobula kuhlii)"
[94] "Korka (Terapon jarbua)"
[95] "Kulam paarai (Carangoides fulvoguttatus)"
[96] "Maagaa (Polynemus plebeius)"
[97] "Mackerel (Rastrelliger kanagurta)"
[98] "Manda clathi (Naso reticulatus)"
[99] "Matha (Acanthurus mata)"
[100] "Milk fish (Chanos chanos)"
[101] "Moon fish (Mene maculata)"
[102] "Mullet (Mugil cephalus)"
[103] "Mural (Tylosurus crocodilus)"
[104] "Myil meen (Istiophorus platypterus)"
[105] "Nalla bontha (Epinephelus sp.)"
[106] "Narba (Caranx sexfasciatus)"
[107] "Paarai (Caranx heberi)"
[108] "Padayappa (Canthidermis maculata)"
[109] "Pali kora (Panna microdon)"
[110] "Pambada (Lepturacanthus savala)"
[111] "Pandukopa (Pseudosciaena manchurica)"

27
[112] "Parava (Lactarius lactarius)"
[113] "Parcus (Psettodes erumei)"
[114] "Parrot fish (Scarus ghobban)"
[115] "Perinkilichai (Pinjalo pinjalo)"
[116] "Phopat (Coryphaena hippurus)"
[117] "Piranha (Pygopritis sp.)"
[118] "Pomfret, snub nose (Trachinotus blochii)"
[119] "Pomfret, white (Pampus argenteus)"
[120] "Pranel (Gerres sp.)"
[121] "Pulli paarai (Gnathanodon speciosus)"
[122] "Queen fish (Scomberoides commersonianus)"
[123] "Raai fish (Lobotes surinamensis)"
[124] "Raai vanthu (Epinephelus chlorostigma)"
[125] "Rani (Pink perch)"
[126] "Ray fish, bow head, spotted (Rhina ancylostoma)"
[127] "Red snapper (Lutjanus argentimaculatus)"
[128] "Red snapper, small (Priacanthus hamrur)"
[129] "Sadaya (Platax orbicularis )"
[130] "Salmon (Salmo salar)"
[131] "Sangada (Nemipterus japanicus)"
[132] "Sankata paarai (Caranx ignobilis)"
[133] "Sardine (Sardinella longiceps)"
[134] "Shark (Carcharhinus sorrah)"
[135] "Shark, hammer head (Sphyrna mokarran)"
[136] "Shark, spotted (Stegostoma fasciatum)"
[137] "Shelavu (Sphyraena jello)"
[138] "Silan (Silonia silondia)"
[139] "Silk fish (Beryx sp.)"
[140] "Silver carp (Hypophthalmichthys molitrix)"
[141] "Sole fish (Cynoglossus arel)"
[142] "Stingray (Dasyatis pastinaca)"
[143] "T arlava (Drepane punctata)"
[144] "Tholam (Plectorhinchus schotaf)"
[145] "Tilapia (Oreochromis niloticus)"
[146] "T una (Euthynnus affinis)"
[147] "T una, striped (Katsuwonus pelamis)"
[148] "Valava (Chirocentrus nudus)"
[149] "Vanjaram (Scomberomorus commerson)"
[150] "Tarlava (Drepane punctata)"
[151] "Tuna (Euthynnus affinis)"
[152] "Tuna, striped (Katsuwonus pelamis)"

28
df <- read_tsv("Table8_amino_acid_profile_no_uncertainity.tsv")

Rows: 612 Columns: 39

-- Column specification --------------------------------------------------------
Delimiter: "\t"
chr (2): food_code, food_name
dbl (37): number_of_regions, Alanine, Arginine, Aspartic Acid, Glutamic Acid...

i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.

df$food_code2 <- substr(df$food_code, 1, 1)

df.na <- df %>%

filter(is.na(number_of_regions)) %>%
select(food_code2, food_name) %>%
drop_na() %>%
unique()
df.na

# A tibble: 11 x 2
food_code2 food_name
<chr> <chr>
1 A CEREALS AND MILLETS
2 E FRUITS
3 G CONDIMENTS AND SPICES-DRY
4 J MUSHROOMS
5 K MISCELLANEOUS FOODS
6 L MILK AND MILK PRODUCTS
7 M EGG AND EGG PRODUCTS
8 N POULTRY
9 O ANIMAL MEAT
10 P MARINE FISH
11 S FRESHWATER FISH AND SHELLFISH

So PC1 separates animal protein from plant protein. Among animal protein, it separates fish
from other animal proteins.

29
6c. The factors Identify the top 2 factors (features) that are associated with your PC1
and that with PC2. [15 points]

You can either using the loadings (was not covered in the class/handson) or calculate correla-
tions (covered in the class/handson)

pca.loadings <- pca$rotation %>% as.data.frame()

pca.loadings$aminoacid <- rownames(pca.loadings)

pca.loadings.var <- pca.loadings %>%

pivot_longer(cols = -c(aminoacid), names_to = "PC", values_to = "loading") %>%
arrange(PC, desc(abs(loading)))
pca.loadings.var

# A tibble: 324 x 3
aminoacid PC loading
<chr> <chr> <dbl>
1 Methionine PC1 -0.335
2 Lysine PC1 -0.320
3 Tyrosine PC1 -0.314
4 Leucine PC1 -0.298
5 Threonine PC1 -0.296
6 Isoleucine PC1 -0.292
7 Histidine PC1 -0.264
8 Valine PC1 -0.260
9 Phenylalanine PC1 -0.243
10 Glutamic Acid PC1 0.231
# i 314 more rows

pca.loadings.var %>% filter(PC == "PC1")

# A tibble: 18 x 3
aminoacid PC loading
<chr> <chr> <dbl>
1 Methionine PC1 -0.335
2 Lysine PC1 -0.320
3 Tyrosine PC1 -0.314
4 Leucine PC1 -0.298
5 Threonine PC1 -0.296
6 Isoleucine PC1 -0.292
7 Histidine PC1 -0.264

30
8 Valine PC1 -0.260
9 Phenylalanine PC1 -0.243
10 Glutamic Acid PC1 0.231
11 Aspartic Acid PC1 0.231
12 Tryptophan PC1 -0.216
13 Glycine PC1 -0.212
14 Cystine PC1 -0.140
15 Arginine PC1 -0.0661
16 Alanine PC1 0.0556
17 Proline PC1 0.0550
18 Serine PC1 -0.0137

pca.loadings.var %>% filter(PC == "PC2")

# A tibble: 18 x 3
aminoacid PC loading
<chr> <chr> <dbl>
1 Cystine PC2 -0.541
2 Alanine PC2 0.330
3 Proline PC2 -0.316
4 Threonine PC2 0.267
5 Valine PC2 0.266
6 Isoleucine PC2 0.252
7 Leucine PC2 0.239
8 Methionine PC2 -0.237
9 Glycine PC2 -0.230
10 Aspartic Acid PC2 0.217
11 Histidine PC2 -0.191
12 Glutamic Acid PC2 0.107
13 Phenylalanine PC2 0.0974
14 Tyrosine PC2 0.0927
15 Serine PC2 0.0887
16 Lysine PC2 -0.0669
17 Tryptophan PC2 0.0305
18 Arginine PC2 -0.0292

pheatmap::pheatmap(round(pca$rotation,2), fontsize = 4, display_numbers=T)

31
corrs <- cor(pca$x, df.orig %>% as.matrix()) %>% as.data.frame()

pc1 <- corrs["PC1",]

pc2 <- corrs["PC2",]
pheatmap::pheatmap(round(corrs,2), fontsize = 4, display_numbers=T)

32
corrs.df <- as.data.frame(corrs)
corrs.df$PC <- rownames(corrs.df)
corrs.df <- reshape2::melt(corrs.df)

Using PC as id variables

corrs.df.pc1 <- corrs.df %>% filter(PC == "PC1") %>% arrange(desc(abs(value)))

corrs.df.pc2 <- corrs.df %>% filter(PC == "PC2") %>% arrange(desc(abs(value)))

corrs.df.pc1

PC variable value
1 PC1 Methionine -0.8162151
2 PC1 Lysine -0.7805996
3 PC1 Tyrosine -0.7670752
4 PC1 Leucine -0.7261129
5 PC1 Threonine -0.7219619
6 PC1 Isoleucine -0.7132331
7 PC1 Histidine -0.6439290

33
8 PC1 Valine -0.6338435
9 PC1 Phenylalanine -0.5921456
10 PC1 Glutamic Acid 0.5644112
11 PC1 Aspartic Acid 0.5643827
12 PC1 Tryptophan -0.5279909
13 PC1 Glycine -0.5180004
14 PC1 Cystine -0.3414679
15 PC1 Arginine -0.1613451
16 PC1 Alanine 0.1357237
17 PC1 Proline 0.1342907
18 PC1 Serine -0.0334551

corrs.df.pc2

PC variable value
1 PC2 Cystine -0.78341155
2 PC2 Alanine 0.47760056
3 PC2 Proline -0.45752737
4 PC2 Threonine 0.38647373
5 PC2 Valine 0.38601124
6 PC2 Isoleucine 0.36442932
7 PC2 Leucine 0.34637683
8 PC2 Methionine -0.34354332
9 PC2 Glycine -0.33386962
10 PC2 Aspartic Acid 0.31399654
11 PC2 Histidine -0.27617274
12 PC2 Glutamic Acid 0.15513500
13 PC2 Phenylalanine 0.14103113
14 PC2 Tyrosine 0.13432469
15 PC2 Serine 0.12842080
16 PC2 Lysine -0.09686546
17 PC2 Tryptophan 0.04415382
18 PC2 Arginine -0.04224612

Methionine explains PC1 and Cystine explains PC2. This makes sense as methionine is twice(?)
as abundant in animal protein.

34
6d. Is it statistically significant? Based on the topmost factor that you identified in 6c
for PC1, perform a statistical test to test if this factor is statistically different between
the two groups you identified in c. To define the two groups you can make use of the
food_code column (HINT: Individual food codes are not so useful but categories
probably are) [15 points]

df.orig$category <- "Plant"

df.orig$category[df.master$food_code2 %in% c("M", "N", "O", "P", "Q", "R", "S")] <- "Animal"
t.test(df.orig$Methionine[df.orig$category=="Plant"],
df.orig$Methionine[df.orig$category=="Animal"])

Welch Two Sample t-test

data: df.orig$Methionine[df.orig$category == "Plant"] and df.orig$Methionine[df.orig$categor

t = -28.613, df = 286.16, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.006870 -1.748538
sample estimates:
mean of x mean of y
1.223644 3.101348

We reject the null as pvalue < 1e-16 –> the methionine content is not same between animal
and plant proteins.

NESTLE INTERNSHIP REPORT 2023
No ratings yet
NESTLE INTERNSHIP REPORT 2023
51 pages
25 Day Homemaking Challenge
No ratings yet
25 Day Homemaking Challenge
74 pages
MH3511 Midterm 2017 Q
No ratings yet
MH3511 Midterm 2017 Q
4 pages
MH 3511 Midterm 2018 So LN
No ratings yet
MH 3511 Midterm 2018 So LN
5 pages
Practical Design of Experiments: DoE Made Easy
From Everand
Practical Design of Experiments: DoE Made Easy
Colin Hardwick
4.5/5 (7)
Dsp
No ratings yet
Dsp
26 pages
Attachment 1
No ratings yet
Attachment 1
3 pages
Statistics Assignment Sample With Solutions
No ratings yet
Statistics Assignment Sample With Solutions
10 pages
Assignment06 1
No ratings yet
Assignment06 1
4 pages
2017aug_02323_02402_solution_en
No ratings yet
2017aug_02323_02402_solution_en
43 pages
WINSEM2015-16 CP1615 18-MAR-2016 RM01 Z-Test For Means and Proprtions
0% (2)
WINSEM2015-16 CP1615 18-MAR-2016 RM01 Z-Test For Means and Proprtions
7 pages
Stat 411_marking Scheme
No ratings yet
Stat 411_marking Scheme
9 pages
10-Sample Techniques - Two Sample
No ratings yet
10-Sample Techniques - Two Sample
7 pages
FinalExam Practice
No ratings yet
FinalExam Practice
16 pages
Solution For Assignment # 2 Sta 5206, 5126 & 4202
No ratings yet
Solution For Assignment # 2 Sta 5206, 5126 & 4202
27 pages
Final Exam Practice Questions
No ratings yet
Final Exam Practice Questions
16 pages
Cu 3008 Assignment 2 Solutions
100% (1)
Cu 3008 Assignment 2 Solutions
4 pages
EDUC/PSY 6600: Unit 2 Homework: Your Name Fall 2019
No ratings yet
EDUC/PSY 6600: Unit 2 Homework: Your Name Fall 2019
48 pages
Unit 2 Assignment SKELETON R spr18
No ratings yet
Unit 2 Assignment SKELETON R spr18
12 pages
Exam3B Solution
No ratings yet
Exam3B Solution
10 pages
HW10 Solu F09
No ratings yet
HW10 Solu F09
4 pages
Lab Test 2018 Answers PDF
No ratings yet
Lab Test 2018 Answers PDF
6 pages
STT 215 Exam 1 Example
No ratings yet
STT 215 Exam 1 Example
5 pages
Final Exam Review
No ratings yet
Final Exam Review
12 pages
HW6 Sol
No ratings yet
HW6 Sol
7 pages
PracticeforTest3.s24
No ratings yet
PracticeforTest3.s24
6 pages
2DI36 - Statistics Final Exam Solution: June 28th, 2013
No ratings yet
2DI36 - Statistics Final Exam Solution: June 28th, 2013
4 pages
ECON20003 S1 2024 Sample Exam
No ratings yet
ECON20003 S1 2024 Sample Exam
27 pages
Modern Regression Homework 5-1
No ratings yet
Modern Regression Homework 5-1
8 pages
Make Sure You Have 8 Pages. You Will Be Provided With A Table As Well, As A Separate Page
No ratings yet
Make Sure You Have 8 Pages. You Will Be Provided With A Table As Well, As A Separate Page
8 pages
Homework2 Chapter4 Solution
No ratings yet
Homework2 Chapter4 Solution
7 pages
AMS 315 Final Examination Solution F2019B PDF
No ratings yet
AMS 315 Final Examination Solution F2019B PDF
16 pages
Problem #1
No ratings yet
Problem #1
6 pages
Fourth Quarter Examination
No ratings yet
Fourth Quarter Examination
7 pages
BS1 Applied Statistics Exam Solutions 2011
No ratings yet
BS1 Applied Statistics Exam Solutions 2011
10 pages
Design of Experiments. Montgomery DoE
No ratings yet
Design of Experiments. Montgomery DoE
6 pages
Assignment No 1,2,3,4
No ratings yet
Assignment No 1,2,3,4
9 pages
Final Exam in Statistics
No ratings yet
Final Exam in Statistics
7 pages
R Lab7
No ratings yet
R Lab7
15 pages
X400004_2021_02_09_Course
No ratings yet
X400004_2021_02_09_Course
8 pages
CH 03 Quiz
No ratings yet
CH 03 Quiz
3 pages
Applied Linear Statistical Models Mid Term 3
0% (1)
Applied Linear Statistical Models Mid Term 3
10 pages
HWK5_SS
No ratings yet
HWK5_SS
11 pages
Confidence Intervals and Hypothesis Testing
No ratings yet
Confidence Intervals and Hypothesis Testing
10 pages
STA108HW4-1
No ratings yet
STA108HW4-1
5 pages
STAT2910_Final_W21.pdf
No ratings yet
STAT2910_Final_W21.pdf
10 pages
Stat 201 MT 2 Cheatsheet
No ratings yet
Stat 201 MT 2 Cheatsheet
2 pages
Stat331 hw1
No ratings yet
Stat331 hw1
4 pages
DNI 2020 Data Exercise Sheet 4 Solution
No ratings yet
DNI 2020 Data Exercise Sheet 4 Solution
5 pages
Introduction To Statistic
No ratings yet
Introduction To Statistic
11 pages
Stats Quiz 2 Cheatsheet (4)
No ratings yet
Stats Quiz 2 Cheatsheet (4)
2 pages
2018-IPS Endterm Sols
No ratings yet
2018-IPS Endterm Sols
14 pages
Stat 350 Midterm 2
No ratings yet
Stat 350 Midterm 2
6 pages
STA1000F test 2 2008 sol
No ratings yet
STA1000F test 2 2008 sol
6 pages
Hns b308 Ib Biostatistics Supplementary Exam-May 2016
No ratings yet
Hns b308 Ib Biostatistics Supplementary Exam-May 2016
6 pages
Stats 250 W15 Exam 2 Solutions
No ratings yet
Stats 250 W15 Exam 2 Solutions
8 pages
Assignment 4
No ratings yet
Assignment 4
3 pages
Unit 3_Unit 4 Problems and Solutions
No ratings yet
Unit 3_Unit 4 Problems and Solutions
30 pages
STAT501 Online - Spring2024 - FinalExam
No ratings yet
STAT501 Online - Spring2024 - FinalExam
14 pages
Regression in R
No ratings yet
Regression in R
40 pages
Homework 9: Independent and Paired Samples T-Tests: Information 1
No ratings yet
Homework 9: Independent and Paired Samples T-Tests: Information 1
7 pages
Final Stat - Review
No ratings yet
Final Stat - Review
6 pages
Growth Performance of Broilers Fed A Quality Protein Maize
No ratings yet
Growth Performance of Broilers Fed A Quality Protein Maize
13 pages
Horticulture Fruits Minor Fruits
No ratings yet
Horticulture Fruits Minor Fruits
1 page
Conjunctions - Transitions - TEST5 - Conjunctions - Transitions - Test5
No ratings yet
Conjunctions - Transitions - TEST5 - Conjunctions - Transitions - Test5
1 page
Perennials Part I Ernst Gotsch
No ratings yet
Perennials Part I Ernst Gotsch
34 pages
Coorg
No ratings yet
Coorg
10 pages
Percentage: TYPE-1: Questions Based On The Basic Concept of Percentage
0% (1)
Percentage: TYPE-1: Questions Based On The Basic Concept of Percentage
7 pages
Paradise in Phuket
No ratings yet
Paradise in Phuket
3 pages
masquerades
No ratings yet
masquerades
18 pages
Conditional Review
No ratings yet
Conditional Review
6 pages
KS3 Biology Key
No ratings yet
KS3 Biology Key
8 pages
Learning Area Grade Level Quarter Date: 10 Fourth
0% (1)
Learning Area Grade Level Quarter Date: 10 Fourth
10 pages
BUGERKING AuditReport 1514263
No ratings yet
BUGERKING AuditReport 1514263
12 pages
Research_Grp7_VER.49 (1)
No ratings yet
Research_Grp7_VER.49 (1)
43 pages
SitcomWhat's in The Salad SCENE 1
No ratings yet
SitcomWhat's in The Salad SCENE 1
3 pages
Choosing MR Bridgerton
No ratings yet
Choosing MR Bridgerton
69 pages
Hotel Hygiene & Housekeeping Development Duties
No ratings yet
Hotel Hygiene & Housekeeping Development Duties
14 pages
Curriculum Vitae Maria Guadalupe Covarrubias Parbul.
No ratings yet
Curriculum Vitae Maria Guadalupe Covarrubias Parbul.
2 pages
should, must, have to
No ratings yet
should, must, have to
3 pages
IOCL Daily Checklist
No ratings yet
IOCL Daily Checklist
11 pages
ĐỀ KIỂM ANH 7TRA CUỐI HKI
No ratings yet
ĐỀ KIỂM ANH 7TRA CUỐI HKI
2 pages
Jack story
No ratings yet
Jack story
2 pages
Bài tập thì httd
No ratings yet
Bài tập thì httd
4 pages
Wisest Wizard
No ratings yet
Wisest Wizard
2 pages
4.4 Cu. Ft. W & B C: 126 Can Capacity
No ratings yet
4.4 Cu. Ft. W & B C: 126 Can Capacity
16 pages
Agatti Island New-1
No ratings yet
Agatti Island New-1
6 pages
Sample Term Paper About Deforestation
100% (1)
Sample Term Paper About Deforestation
4 pages
Fruit Vegetables Food Drinks Part1
No ratings yet
Fruit Vegetables Food Drinks Part1
8 pages
Dissertation Environmental Engineering
100% (2)
Dissertation Environmental Engineering
4 pages

DH302 Spring2025 Assignment02-Solutions

Uploaded by

DH302 Spring2025 Assignment02-Solutions

Uploaded by

DH 302 Spring 2025 Assignment 02 Solutions

The assignment is based on Lecture 7, Lecture 8, Lecture 9, and Lecture 10.

Figure 1: Problem 1 image is available here

1a. What is the null hypothesis? [2.5 points]

Problem 02 [10 points]

2a. Draw a boxplot, violin and ridgeline plot [5 points]

# OUTPUT should be 3 plots

p1 <- ggplot(df, mapping = aes(x = group, y = value, color = group)) +

Picking joint bandwidth of 0.519

# YOUR CODE HERE FOR GENERATING A QQ PLOT

Problem 03 [10 points]

Problem 04 [25 points]

“What I cannot build, I cannot understand – Feynman”. We studied different t-tests

for (i in seq(1, N_tries)) {

# OUTPUT: n_rejects -- Number of rejections for alpha=0.05

# OUTPUT: n_rejects -- Number of rejections for alpha=0.05

# OUTPUT: n_rejects -- Number of rejections for alpha=0.05

4f. Severe violation3: Repeat 4e with n_sample_size1 = 70 and n_sample_size1=70,

# OUTPUT: n_rejects -- Number of rejections for alpha=0.05

4g. Severe violation4: Repeat 4f with n_sample_size1 = 70 and n_sample_size2=30,

# OUTPUT: n_rejects -- Number of rejections for alpha=0.05

exp_data <- rexp(100, rate = 5)

# OUTPUT: n_rejects -- Number of rejections for alpha=0.05

for (i in seq(1, N_tries)) {

# OUTPUT: n_rejects -- Number of rejections for alpha=0.05

for (i in seq(1, N_tries)) {

5a. What is the null and alternative hypothesis? [2.5 points]

• test for normality

# CODE for validity of your test of choice

t.test(x = df$measured_6weekslater, df$baseline, paired = T, var.equal = F, alternative = "le

data: df$measured_6weekslater and df$baseline

5e. Non parameteric test [5 points]

wilcox.test(x = df$measured_6weekslater, df$baseline, paired = T, alternative = "less")

Wilcoxon signed rank exact test

data: df$measured_6weekslater and df$baseline

Our results remain unchanged.

data: df2$measured_6weekslater and df2$baseline

1. food_code = a short code for different food items,

# YOUR CODE GOES HERE

[1] "food_code" "food_name" "number_of_regions"

df <- df[, 1:21]

pca.x <- pca$x %>% as.data.frame()

df1 <- pca.x %>% filter(PC1 < 0)

[1] "Goat, shoulder"

Rows: 612 Columns: 39

df$food_code2 <- substr(df$food_code, 1, 1)

df.na <- df %>%

pca.loadings <- pca$rotation %>% as.data.frame()

pca.loadings.var <- pca.loadings %>%

pca.loadings.var %>% filter(PC == "PC1")

pca.loadings.var %>% filter(PC == "PC2")

pheatmap::pheatmap(round(pca$rotation,2), fontsize = 4, display_numbers=T)

pc1 <- corrs["PC1",]

corrs.df.pc1 <- corrs.df %>% filter(PC == "PC1") %>% arrange(desc(abs(value)))

df.orig$category <- "Plant"

Welch Two Sample t-test

data: df.orig$Methionine[df.orig$category == "Plant"] and df.orig$Methionine[df.orig$categor

You might also like