DH302 Spring2025 Assignment02-Solutions
DH302 Spring2025 Assignment02-Solutions
Solutions
Instructions
Submit your solutions via gradescope by 11:59 PM (IST) Tuesday, 18𝑡ℎ February 2025. In-
person submissions will not be entertained. Please upload a single PDF file. Late submissions
are allowed with a 10% per day penalty (only upto 22nd February). You can raise your
questions related to the assignment on Piazza - please tag these as assignment_02.
• For theory questions, you can either write your response for in latex or put a
screenshot/picture of your handwritten response in the appropriate section. To
embed scanned images, use this format: 
where /path/to/question1.png is the local path (on your laptop) to your scanned
(handwritten) response to question1.
• If you are writing the solutions for theory questions by hand please use a pen. Pencil
submissions are difficult to read when scanned. You will have to take a scan for each
such answer and embed it in this document.
• Your final submission has to be the PDF that comes from this template - one single pdf.
No Exceptions.
• Please mention the name(s) of people you collaborated with and what exactly you dis-
cussed.
Making your submission: Raw template is available here and an upto date version of
this pdf is available here. Open the template in Rstudio (you will need to ensure Quarto is
installed). Once you are done with your answers, use the “render” (arrow like) button on the
toolbar to create a pdf. Only pdf submissions are allowed.
1
Problem 01 [25 points]
Quality of life: Improvement in quality of life was measured for a group of heart disease
patients after 8 weeks in an exercise program. This experimental group was compared to a
control group who were not in the exercise program. Quality of life was measured using a
21-item questionnaire, with scores of 1–5 on each item. The improvement data are as follows
and are plotted below.
1b. What conclusion could you draw from the dotplot? [2.5 points]
Quality of life scores for the exercise group are higher than the control group.
1c. Here is computer output for a t test. Explain what the P-value means in the context
of this study. [5 points]
t = 2.505, df = 33.23, p-value = 0.00866 alternative hypothesis: true difference in means is
greater than 0
The p-value is the probability of observing the statististic at least as extreme (2.505 or higher)
if the null hypothesis is true. In this case, the p-value is 0.00866, which is less than 0.05 and
hence we reject the null. The alternative is that the true difference in means is greater than 0
i.e. the quality of life improves in the exercise group.
2
Full points only if the statistic direction is talked about + the conclulsion is in the context of
the study.
1d. If type-1 error 𝛼 = 0.01, what is your conclusion regarding 𝐻0 ? State your conclusion
in the specific context of this problem. [5 points]
We reject the H0 since the p-value is 0.0086.
1e. The computer output in part (c) is for the directional test. What is the P-value for
the nondirectional test? [5 points]
The p-value for the nondirectional test is 0.00866*2 = 0.01732
1f. If the test were nondirectional, and 𝛼= 0.01, what conclusions would we make? [5
points]
We would fail to reject the null hypothesis.
Normality goes for a toss: Researchers took skin samples from 10 patients who had breast
implants and from a control group of 6 patients. They recorded the level of interleukin-6 or
IL6 in picogram/ml/10 g of tissue, a measure of tissue inflammation, after each tissue sample
was cultured for 24 hours. The dataset is available below (in R)
il6.breast.implant.patients <- c(
231, 308287, 33291, 124550, 17075,
22955, 95102, 5649, 840585, 58924
)
il6.control.patients <- c(35324, 12457, 8276, 44, 278, 840)
df.breast <- data.frame(value = il6.breast.implant.patients)
df.contorl <- data.frame(value = il6.control.patients)
3
library(ggridges)
library(patchwork)
library(tidyverse)
theme_set(ggpubr::theme_pubr())
df <- bind_rows(list(
`Breast implant` = df.breast,
`Control` = df.contorl
), .id = "group")
p1 | p2 | p3
4
2b. Draw a Q-Q plot for both the measurements [5 points]
You can use the geom_qq function
$title
[1] "Control"
attr(,"class")
[1] "labels"
5
# !! DO NOT EDIT/REMOVE !!
library(patchwork)
implant.qqplot | contro.qqplot
Sneak peek: I perform a t test of the null hypothesis that two means are equal. I decided to
calculate the means and then choose an alternative hypothesis 𝐻𝐴 𝜇1 > 𝜇2 because I observed
𝑦1̄ > 𝑦2̄ .
3a. Explain what is wrong (if anything) with this procedure and why it is wrong (if
anything). [5 points]
The null hypothesis and alternative hypothesis should be decided before the data is collected.
Deciding the alternative hypothesis based on the data is data ‘snooping’ and can lead to biased
results (p-value hijacking).
6
3b. Suppose I reported t = 1.97 on 25 degrees of freedom and a P-value of 0.03. What
is the proper P-value? [5 points]
The proper p-value is 0.03. The p-value is the probability of observing the statististic at least
as extreme (1.97 or higher) if the null hypothesis is true.
4a. The defaults. Simulate two normal samples (n=200) with each mean = 10 and sd =
1, 10000 times. Apply a t-test for each iteration and calculate the p-value. With
𝛼 = 0.05, how many times do you reject the null hypothesis that the mean of two
samples is equal? What do you conclude? [2.5 points]
# # !! DO NOT EDIT/REMOVE !!
# OUTPUT: n_rejects -- Number of rejections for alpha=0.05
set.seed(42)
alpha <- 0.05
n_rejections <- 0
N_tries <- 10000
n_sample_size <- 200
[1] 0.0511
7
n_rejections/N_tries: 0.05
Conclusion: We expect a type1 error of 0.05 which is also what we get - the null hypothesis
is rejected 5% of the time when it is true.
4b. Unequal variance. Simulate two normal samples (n=100) with each mean = 10 and
sd1 = 1 and sd2=2.5, 10000 times. Apply a t-test for each iteration and calculate the
p-value. With 𝛼 = 0.05, how many times do you reject the null hypothesis that the mean
of two samples is equal? What do you conclude and is that unusual? Use the default
t.test [2.5 points]
set.seed(42)
alpha <- 0.05
n_rejections <- 0
N_tries <- 10000
n_sample_size <- 200
mean1 <- 10
mean2 <- mean1
sd1 <- 1
sd2 <- 2.5
for (i in seq(1, N_tries)) {
sample1 <- rnorm(n = n_sample_size, mean = mean1, sd = sd1)
sample2 <- rnorm(n = n_sample_size, mean = mean2, sd = sd2)
pval <- t.test(sample1, sample2)$p.value
if (pval <= alpha) {
n_rejections <- n_rejections + 1
}
}
n_rejections / N_tries
[1] 0.0519
n_rejections/N_tries: 0.0519
Conclusion: Even by violoating the the equal variance assumption, the type 1 error rate
is still close to 0.05. One reason this happens here because t.test() by default uses
var.equal=FALSE.
8
4c. Unequal variance revisited: Repeat the example in 4b with now
var.equal=FALSE:t.test(var.equal=TRUE). With 𝛼 = 0.05, how many times do you
reject the null hypothesis that the mean of two samples is equal? What do you
conclude? [2.5 points]
set.seed(42)
alpha <- 0.05
n_rejections <- 0
N_tries <- 10000
n_sample_size <- 200
mean1 <- 10
mean2 <- mean1
sd1 <- 1
sd2 <- 2.5
for (i in seq(1, N_tries)) {
sample1 <- rnorm(n = n_sample_size, mean = mean1, sd = sd1)
sample2 <- rnorm(n = n_sample_size, mean = mean2, sd = sd2)
pval <- t.test(sample1, sample2, var.equal = TRUE)$p.value
if (pval <= alpha) {
n_rejections <- n_rejections + 1
}
}
n_rejections / N_tries
[1] 0.0523
n_rejections/N_tries: 0.05
Conclusion: The type1 error rate is still close to 0.05 despite forcing the equal variance
assumption. This is because the sample size is large across both samples and the t-distribution
is robust to the assumption of equal variance.
4d. Severe violation: Following 4c, now set sd1=10, sd2=1, and use a non-welch t-test to
tabulate the number of times you reject the null? With 𝛼 = 0.05, how many times do you
reject the null hypothesis that the mean of two samples is equal? What do you conclude?
[2.5 points]
9
# OUTPUT: n_rejects -- Number of rejections for alpha=0.05
set.seed(42)
alpha <- 0.05
n_rejections <- 0
N_tries <- 10000
n_sample_size <- 200
mean1 <- 10
mean2 <- mean1
sd1 <- 10
sd2 <- 1
for (i in seq(1, N_tries)) {
sample1 <- rnorm(n = n_sample_size, mean = mean1, sd = sd1)
sample2 <- rnorm(n = n_sample_size, mean = mean2, sd = sd2)
pval <- t.test(sample1, sample2, var.equal = TRUE)$p.value
if (pval <= alpha) {
n_rejections <- n_rejections + 1
}
}
n_rejections / N_tries
[1] 0.0527
n_rejections/N_tries: 0.0527
Conclusion: The type1 error rate is still close to 0.05 despite forcing the equal variance
assumption. This is because the sample size is large across both samples and the t-distribution
is robust to the assumption of equal variance.
4e. Severe violation2: Following 4d, now simulate different sample sizes with
n_sample_size1 =30 and n_sample_size2=70, sd1=10, sd2=1, and use a non=welch
t-test to tabulate the number of times you reject the null? With 𝛼 = 0.05, how many
times do you reject the null hypothesis that the mean of two samples is equal? What do
you conclude? [2.5 points]
set.seed(42)
alpha <- 0.05
10
n_rejections <- 0
N_tries <- 10000
n_sample_size1 <- 30
n_sample_size2 <- 70
mean1 <- 10
mean2 <- mean1
sd1 <- 10
sd2 <- 1
for (i in seq(1, N_tries)) {
sample1 <- rnorm(n = n_sample_size1, mean = mean1, sd = sd1)
sample2 <- rnorm(n = n_sample_size2, mean = mean2, sd = sd2)
pval <- t.test(sample1, sample2, var.equal = TRUE)$p.value
if (pval <= alpha) {
n_rejections <- n_rejections + 1
}
}
n_rejections / N_tries
[1] 0.2068
n_rejections/N_tries: 0.26
Conclusion: The pvalue is now 0.26, despite the original samples being very different. This
is because of difference in sample sizes across the two populations. t-test is more or less robust
to assumption of equal variance, but not to sample size differences.
set.seed(42)
alpha <- 0.05
n_rejections <- 0
N_tries <- 10000
n_sample_size1 <- 70
11
n_sample_size2 <- 30
mean1 <- 10
mean2 <- mean1
sd1 <- 10
sd2 <- 1
for (i in seq(1, N_tries)) {
sample1 <- rnorm(n = n_sample_size1, mean = mean1, sd = sd1)
sample2 <- rnorm(n = n_sample_size2, mean = mean2, sd = sd2)
pval <- t.test(sample1, sample2, var.equal = TRUE)$p.value
if (pval <= alpha) {
n_rejections <- n_rejections + 1
}
}
n_rejections / N_tries
[1] 0.0043
This is a very conservative pvalue, i.e. if the sample size and sample variance are both high,
the t-test is very conservative and the type1 error rate is very low.
set.seed(42)
alpha <- 0.05
n_rejections <- 0
N_tries <- 10000
n_sample_size1 <- 70
n_sample_size2 <- 30
mean1 <- 10
mean2 <- mean1
sd1 <- 10
sd2 <- 1
for (i in seq(1, N_tries)) {
sample1 <- rnorm(n = n_sample_size1, mean = mean1, sd = sd1)
12
sample2 <- rnorm(n = n_sample_size2, mean = mean2, sd = sd2)
pval <- t.test(sample1, sample2, var.equal = FALSE)$p.value
if (pval <= alpha) {
n_rejections <- n_rejections + 1
}
}
n_rejections / N_tries
[1] 0.0493
n_rejections/N_tries: 0.0493
Conclusion: The pvalue is now 0.05, despite the original samples being very different. Welch
test handles the difference in sample sizes and variances well.
4h. Toss in exponential: Hopefully you have got a feeling of what is happening. Now we
take normality for a toss. I have mentioned multiple times in the class, it is a relaxable
assumption, but is it really? [2.5 points]
For example if we simulate an exponential distribution this is what it looks like
13
For a n_sample_size1 = 70 and n_sample_size1=n_sample_size2=100, rate parameters r1=5
and r2=5 use a t-test to tabulate the number of times you reject the null? With 𝛼 = 0.05, how
many times do you reject the null hypothesis that the mean of two samples is equal? What
do you conclude?
set.seed(42)
alpha <- 0.05
n_rejections <- 0
N_tries <- 10000
n_sample_size1 <- 100
n_sample_size2 <- n_sample_size1
rate1 <- 5
rate2 <- 5
14
n_rejections <- n_rejections + 1
}
}
n_rejections / N_tries
[1] 0.0495
n_rejections/N_tries: 0.0495
Conclusion: The loss of normality is not a big deal for the t-test. The type1 error rate is
still close to 0.05.
4i. Toss in exponential2: Following 4h, repeat the experment with sample_size1=n_sample_size2=100,
rate parameters r1=5 and r2=10 use a t-test to tabulate the number of times you reject
the null? With 𝛼 = 0.05, how many times do you reject the null hypothesis that the mean
of two samples is equal? What do you conclude? [5 points]
set.seed(42)
alpha <- 0.05
n_rejections <- 0
N_tries <- 10000
n_sample_size1 <- 100
n_sample_size2 <- n_sample_size1
rate1 <- 5
rate2 <- 10
[1] 0.9982
Ahh, the error rate is now ~1. We lose power - ability to detect differences when they exist if
the data is not approximately normal.
15
Problem 05 [30 points]
Standard error: Adults who broke their wrists were tested to see if their grip strength (kg)
decreased over 6 weeks. The data is provided as a tsv here.
5b. Which test would you employ to test your hypothesis and why? [2.5 points]
Paired t-test since the same subjects are measured at two different time points.
No points for any other answer
5c. If there are conditions to be satisfied for applying your test, write R code to output
checks for validity of your test. [5 points]
df <- read_tsv("Assignment02_grip_strength.tsv")
Rows: 20 Columns: 3
-- Column specification --------------------------------------------------------
Delimiter: "\t"
dbl (3): subject, baseline, measured_6weekslater
i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
qqnorm(df$baseline)
qqline(df$baseline)
16
For the 6 weeks measurement:
qqnorm(df$measured_6weekslater)
qqline(df$measured_6weekslater)
17
5d. Perform a test (using R) to test your hypothesis in a) reporting p-value and the
conclusion. [5 points]
Paired t-test
p-value: 1
18
Conclusion: We fail to reject the null hypothesis based on a directional paired t-test (i.e. un-
der the given alternative hypothesis). Note that the direction of the effect is exactly opposite
to what we expected (which is why the pvalue is 1).
5f. Transform the original variable by a log() transformation and repeat your analysis in
5c and 5d [10 points]
We perform a log transformation by first adding 1 and then transofmring under log.
df2 <- df
df2$baseline <- log(1 + df$baseline)
df2$measured_6weekslater <- log(1 + df$measured_6weekslater)
qqnorm(df2$baseline)
qqline(df2$baseline)
19
qqnorm(df2$measured_6weekslater)
qqline(df2$measured_6weekslater)
20
t.test(x = df2$measured_6weekslater, df2$baseline, paired = T, var.equal = F, alternative = "
Paired t-test
p-value: 1
Question: how does the result of your analysis compare with your conclusion in d?
Conclusion: The t statistic is now smaller, but the overall conclusion is unchanged. We fail
to reject the null hypotheis.
21
Problem 06 [50 points]
PPT - Principal proteins test: This is a data heavy question and designed to give you
another exposure to real world data which is messy, hard to parse and often not so well
documented. It is also to tell you how a random (or not so random) twitter/X thread can be
turned into an assignment problem - which is one of the reasons the assignment was delayed
(the other one being figuring out how to combat ChatGPT usage for direct copy pasting
answers (not code) ). It is also to applaud the government of India for the wealth of data it
collects - you just need to find(parse) it.
The tsv here has the amino acid content of different food items. The data was programmatically
extracted from this PDF and contains the amino acid profile of various food items - some
familar and some not so familiar ones. Very few people (in the world) have taken a look at
this data in the way you are going to.
Using dimensionality reduction techniques we studied in the class, perform an exploratory
analysis and build a story around your plot. While the question is broad, there are full points
only for specific answers. The questions are broad, but your responses should be specific.
GENERAL RUBRIC: If the plot looks exactly like the PCA plot above give extra +5
points. No penalty for including the additional columns.
6a. The plot. Plot the output of PCA to demonstrate what the lower dimensional
representation of the data looks like. [10 points]
Remember the dataset is messy, if you want to do dimensionality reduction, you want to only
retain numeric columns. While the original tsv also has uncertainity, to make the task easier
I have generated a cleaned version of the tsv here removing the uncertainity values. There are
21 columns in total:
22
Rows: 612 Columns: 39
-- Column specification --------------------------------------------------------
Delimiter: "\t"
chr (2): food_code, food_name
dbl (37): number_of_regions, Alanine, Arginine, Aspartic Acid, Glutamic Acid...
i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
df <- df %>%
drop_na() %>%
unique() %>%
as.data.frame()
colnames(df)
Note that the last few columns are the std deviation columns. You should remove these. If
not -2.5 points.
23
pca <- prcomp(df %>% as.matrix(), scale. = T)
my_colors <- c(
"#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd",
"#8c564b", "#e377c2", "#7f7f7f", "#bcbd22", "#17becf",
"#aec7e8", "#ffbb78", "#98df8a", "#ff9896", "#c5b0d5",
"#c49c94", "#f7b6d2", "#c7c7c7", "#dbdb8d", "#9edae5"
)
6b. The story Do you notice any discernible pattern in your data? What are some factor(s)
that explain your PC1 and PC2? [10 points]
24
Full points for guessing the animal vs plant protein divide.
We can probe this further by splitting the x axis (a bit arbitrarily but from the above plot)
0 A B C D E G H J L N O P R S
1 5 5 10 3 7 2 2 4 3 14 63 89 2 10
what is O and P?
df.master %>%
filter(food_code2 %in% c("O", "P")) %>%
pull(food_name)
25
[26] "Beef, chops"
[27] "Beef, round (leg)"
[28] "Beef, brain"
[29] "Beef, tongue"
[30] "Beef, lungs"
[31] "Beef, heart"
[32] "Beef, liver"
[33] "Beef, tripe"
[34] "Beef, spleen"
[35] "Beef, kidneys"
[36] "Calf, shoulder"
[37] "Calf, chops"
[38] "Calf, round (leg)"
[39] "Calf, brain"
[40] "Calf, tongue"
[41] "Calf, heart"
[42] "Calf, liver"
[43] "Calf, spleen"
[44] "Calf, kidneys"
[45] "Mithun, shoulder"
[46] "Mithun, chops"
[47] "Mithun, round (leg)"
[48] "Pork, shoulder"
[49] "Pork, chops"
[50] "Pork, ham"
[51] "Pork, lungs"
[52] "Pork, heart"
[53] "Pork, liver"
[54] "Pork, stomach"
[55] "Pork, spleen"
[56] "Pork, kidneys"
[57] "Pork, tube (small intestine)"
[58] "Hare, shoulder"
[59] "Hare, chops"
[60] "Hare, leg"
[61] "Rabbit, shoulder"
[62] "Rabbit, chops"
[63] "Rabbit, leg"
[64] "Allathi (Elops machnata)"
[65] "Aluva (Parastromateus niger)"
[66] "Anchovy (Stolephorus indicus)"
[67] "Ari fish (Aprion virescens)"
[68] "Betki (Lates calcarifer)"
26
[69] "Black snapper (Macolor niger)"
[70] "Bombay duck (Harpadon nehereus)"
[71] "Bommuralu (Muraenesox cinerius)"
[72] "Cat fish (Tachysurus thalassinus)"
[73] "Chakla (Rachycentron canadum)"
[74] "Chappal (Aluterus monoceros )"
[75] "Chelu (Elagatis bipinnulata)"
[76] "(Lutjanus quinquelineatus) Chembali"
[77] "Eri meen (Pristipomoides filamentosus)"
[78] "Gobro (Epinephelus diacanthus)"
[79] "Guitar fish (Rhinobatus prahli)"
[80] "Hilsa (Tenualosa ilisha)"
[81] "Jallal (Arius sp.)"
[82] "Jathi vela meen (Lethrinus lentjan)"
[83] "Kadal bral (Synodus indicus)"
[84] "Kadali (Nemipterus mesoprion)"
[85] "Kalamaara (Leptomelanosoma indicum)"
[86] "Kalava (Epinephelus coioides)"
[87] "Kanamayya (Lutjanus rivulatus)"
[88] "Kannadi paarai (Alectis indicus)"
[89] "Karimeen (Etroplus suratensis)"
[90] "Karnagawala (Anchoa hepsetus)"
[91] "Kayrai (Thunnus albacores)"
[92] "Kiriyan (Atule mate)"
[93] "Kite fish (Mobula kuhlii)"
[94] "Korka (Terapon jarbua)"
[95] "Kulam paarai (Carangoides fulvoguttatus)"
[96] "Maagaa (Polynemus plebeius)"
[97] "Mackerel (Rastrelliger kanagurta)"
[98] "Manda clathi (Naso reticulatus)"
[99] "Matha (Acanthurus mata)"
[100] "Milk fish (Chanos chanos)"
[101] "Moon fish (Mene maculata)"
[102] "Mullet (Mugil cephalus)"
[103] "Mural (Tylosurus crocodilus)"
[104] "Myil meen (Istiophorus platypterus)"
[105] "Nalla bontha (Epinephelus sp.)"
[106] "Narba (Caranx sexfasciatus)"
[107] "Paarai (Caranx heberi)"
[108] "Padayappa (Canthidermis maculata)"
[109] "Pali kora (Panna microdon)"
[110] "Pambada (Lepturacanthus savala)"
[111] "Pandukopa (Pseudosciaena manchurica)"
27
[112] "Parava (Lactarius lactarius)"
[113] "Parcus (Psettodes erumei)"
[114] "Parrot fish (Scarus ghobban)"
[115] "Perinkilichai (Pinjalo pinjalo)"
[116] "Phopat (Coryphaena hippurus)"
[117] "Piranha (Pygopritis sp.)"
[118] "Pomfret, snub nose (Trachinotus blochii)"
[119] "Pomfret, white (Pampus argenteus)"
[120] "Pranel (Gerres sp.)"
[121] "Pulli paarai (Gnathanodon speciosus)"
[122] "Queen fish (Scomberoides commersonianus)"
[123] "Raai fish (Lobotes surinamensis)"
[124] "Raai vanthu (Epinephelus chlorostigma)"
[125] "Rani (Pink perch)"
[126] "Ray fish, bow head, spotted (Rhina ancylostoma)"
[127] "Red snapper (Lutjanus argentimaculatus)"
[128] "Red snapper, small (Priacanthus hamrur)"
[129] "Sadaya (Platax orbicularis )"
[130] "Salmon (Salmo salar)"
[131] "Sangada (Nemipterus japanicus)"
[132] "Sankata paarai (Caranx ignobilis)"
[133] "Sardine (Sardinella longiceps)"
[134] "Shark (Carcharhinus sorrah)"
[135] "Shark, hammer head (Sphyrna mokarran)"
[136] "Shark, spotted (Stegostoma fasciatum)"
[137] "Shelavu (Sphyraena jello)"
[138] "Silan (Silonia silondia)"
[139] "Silk fish (Beryx sp.)"
[140] "Silver carp (Hypophthalmichthys molitrix)"
[141] "Sole fish (Cynoglossus arel)"
[142] "Stingray (Dasyatis pastinaca)"
[143] "T arlava (Drepane punctata)"
[144] "Tholam (Plectorhinchus schotaf)"
[145] "Tilapia (Oreochromis niloticus)"
[146] "T una (Euthynnus affinis)"
[147] "T una, striped (Katsuwonus pelamis)"
[148] "Valava (Chirocentrus nudus)"
[149] "Vanjaram (Scomberomorus commerson)"
[150] "Tarlava (Drepane punctata)"
[151] "Tuna (Euthynnus affinis)"
[152] "Tuna, striped (Katsuwonus pelamis)"
28
df <- read_tsv("Table8_amino_acid_profile_no_uncertainity.tsv")
i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 11 x 2
food_code2 food_name
<chr> <chr>
1 A CEREALS AND MILLETS
2 E FRUITS
3 G CONDIMENTS AND SPICES-DRY
4 J MUSHROOMS
5 K MISCELLANEOUS FOODS
6 L MILK AND MILK PRODUCTS
7 M EGG AND EGG PRODUCTS
8 N POULTRY
9 O ANIMAL MEAT
10 P MARINE FISH
11 S FRESHWATER FISH AND SHELLFISH
So PC1 separates animal protein from plant protein. Among animal protein, it separates fish
from other animal proteins.
29
6c. The factors Identify the top 2 factors (features) that are associated with your PC1
and that with PC2. [15 points]
You can either using the loadings (was not covered in the class/handson) or calculate correla-
tions (covered in the class/handson)
# A tibble: 324 x 3
aminoacid PC loading
<chr> <chr> <dbl>
1 Methionine PC1 -0.335
2 Lysine PC1 -0.320
3 Tyrosine PC1 -0.314
4 Leucine PC1 -0.298
5 Threonine PC1 -0.296
6 Isoleucine PC1 -0.292
7 Histidine PC1 -0.264
8 Valine PC1 -0.260
9 Phenylalanine PC1 -0.243
10 Glutamic Acid PC1 0.231
# i 314 more rows
# A tibble: 18 x 3
aminoacid PC loading
<chr> <chr> <dbl>
1 Methionine PC1 -0.335
2 Lysine PC1 -0.320
3 Tyrosine PC1 -0.314
4 Leucine PC1 -0.298
5 Threonine PC1 -0.296
6 Isoleucine PC1 -0.292
7 Histidine PC1 -0.264
30
8 Valine PC1 -0.260
9 Phenylalanine PC1 -0.243
10 Glutamic Acid PC1 0.231
11 Aspartic Acid PC1 0.231
12 Tryptophan PC1 -0.216
13 Glycine PC1 -0.212
14 Cystine PC1 -0.140
15 Arginine PC1 -0.0661
16 Alanine PC1 0.0556
17 Proline PC1 0.0550
18 Serine PC1 -0.0137
# A tibble: 18 x 3
aminoacid PC loading
<chr> <chr> <dbl>
1 Cystine PC2 -0.541
2 Alanine PC2 0.330
3 Proline PC2 -0.316
4 Threonine PC2 0.267
5 Valine PC2 0.266
6 Isoleucine PC2 0.252
7 Leucine PC2 0.239
8 Methionine PC2 -0.237
9 Glycine PC2 -0.230
10 Aspartic Acid PC2 0.217
11 Histidine PC2 -0.191
12 Glutamic Acid PC2 0.107
13 Phenylalanine PC2 0.0974
14 Tyrosine PC2 0.0927
15 Serine PC2 0.0887
16 Lysine PC2 -0.0669
17 Tryptophan PC2 0.0305
18 Arginine PC2 -0.0292
31
corrs <- cor(pca$x, df.orig %>% as.matrix()) %>% as.data.frame()
32
corrs.df <- as.data.frame(corrs)
corrs.df$PC <- rownames(corrs.df)
corrs.df <- reshape2::melt(corrs.df)
Using PC as id variables
corrs.df.pc1
PC variable value
1 PC1 Methionine -0.8162151
2 PC1 Lysine -0.7805996
3 PC1 Tyrosine -0.7670752
4 PC1 Leucine -0.7261129
5 PC1 Threonine -0.7219619
6 PC1 Isoleucine -0.7132331
7 PC1 Histidine -0.6439290
33
8 PC1 Valine -0.6338435
9 PC1 Phenylalanine -0.5921456
10 PC1 Glutamic Acid 0.5644112
11 PC1 Aspartic Acid 0.5643827
12 PC1 Tryptophan -0.5279909
13 PC1 Glycine -0.5180004
14 PC1 Cystine -0.3414679
15 PC1 Arginine -0.1613451
16 PC1 Alanine 0.1357237
17 PC1 Proline 0.1342907
18 PC1 Serine -0.0334551
corrs.df.pc2
PC variable value
1 PC2 Cystine -0.78341155
2 PC2 Alanine 0.47760056
3 PC2 Proline -0.45752737
4 PC2 Threonine 0.38647373
5 PC2 Valine 0.38601124
6 PC2 Isoleucine 0.36442932
7 PC2 Leucine 0.34637683
8 PC2 Methionine -0.34354332
9 PC2 Glycine -0.33386962
10 PC2 Aspartic Acid 0.31399654
11 PC2 Histidine -0.27617274
12 PC2 Glutamic Acid 0.15513500
13 PC2 Phenylalanine 0.14103113
14 PC2 Tyrosine 0.13432469
15 PC2 Serine 0.12842080
16 PC2 Lysine -0.09686546
17 PC2 Tryptophan 0.04415382
18 PC2 Arginine -0.04224612
Methionine explains PC1 and Cystine explains PC2. This makes sense as methionine is twice(?)
as abundant in animal protein.
34
6d. Is it statistically significant? Based on the topmost factor that you identified in 6c
for PC1, perform a statistical test to test if this factor is statistically different between
the two groups you identified in c. To define the two groups you can make use of the
food_code column (HINT: Individual food codes are not so useful but categories
probably are) [15 points]
We reject the null as pvalue < 1e-16 –> the methionine content is not same between animal
and plant proteins.
35