Permutation Hypothesis Test in R Programming
Last Updated :
24 Nov, 2020
In simple words, the permutation hypothesis test in R is a way of comparing a numerical value of 2 groups. The permutation Hypothesis test is an alternative to:
Let's implement this test in R programming.
Why use the Permutation Hypothesis Test?
- Small Sample Size.
- Assumptions(for parametric approach) not met.
- Test something other than classic approaches comparing Means and Medians.
- Difficult to estimate the SE for test-statistic.
Permutation Hypothesis Test Steps
- Specify a hypothesis
- Choose test-stat(Eg: Mean, Median, etc. )
- Determine Distribution of test-stat
- Convert test-stat to P-value
Note:
P-value = No. of permutations having a test-stat value greater than observed test-stat value/ No. of permutations.
Implementation in R
- Dataset: Chicken Diet Data. This dataset is a subset of the "chickwts" data in the "R dataset package". Download the data set here.
- Hypothesis: The weight of the chicken is independent of the type of diet.
Test-Statistics
- Test-Statistics #1: The absolute value of the difference in mean weights for the two diets | Y1 - Y2 |. This is the same test statistics as the independent two-sided two-sample t-test.
- Test-Statistics #2: The absolute value of the difference in median weights for the two diets | Median1 - Median2 |
R
# R program to illustrate
# Permutation Hypothesis Test
# load the data set
d <- read.table(file = "ChickData.csv",
header = T, sep = ",")
# print the dataset
print(d)
# check the names
names(d)
levels(d$feed)
# how many observations in each diet?
table(d$feed)
# let's look at a boxplot of weight gain by those 2 diets
boxplot(d$weight~d$feed, las = 1,
ylab = "weight (g)",
xlab = "feed",
main = "Weight by Feed")
# calculate the difference in sample MEANS
mean(d$weight[d$feed == "casein"]) # mean for casein
mean(d$weight[d$feed == "meatmeal"]) # mean for meatmeal
# lets calculate the absolute diff in means
test.stat1 <- abs(mean(d$weight[d$feed == "casein"]) -
mean(d$weight[d$feed == "meatmeal"]))
test.stat1
# calculate the difference in sample MEDIANS
median(d$weight[d$feed == "casein"]) # median for casein
median(d$weight[d$feed == "meatmeal"]) # median for meatmeal
# lets calculate the absolute diff in medians
test.stat2 <- abs(median(d$weight[d$feed == "casein"]) -
median(d$weight[d$feed == "meatmeal"]))
test.stat2
# Permutation Test
# for reproducability of results
set.seed(1979)
# the number of observations to sample
n <- length(d$feed)
# the number of permutation samples to take
P <- 100000
# the variable we will resample from
variable <- d$weight
# initialize a matrix to store the permutation data
PermSamples <- matrix(0, nrow = n, ncol = P)
# each column is a permutation sample of data
# now, get those permutation samples, using a loop
# let's take a moment to discuss what that code is doing
for(i in 1:P)
{
PermSamples[, i] <- sample(variable,
size = n,
replace = FALSE)
}
# we can take a quick look at the first 5 columns of PermSamples
PermSamples[, 1:5]
# initialize vectors to store all of the Test-stats
Perm.test.stat1 <- Perm.test.stat2 <- rep(0, P)
# loop thru, and calculate the test-stats
for (i in 1:P)
{
# calculate the perm-test-stat1 and save it
Perm.test.stat1[i] <- abs(mean(PermSamples[d$feed == "casein",i]) -
mean(PermSamples[d$feed == "meatmeal",i]))
# calculate the perm-test-stat2 and save it
Perm.test.stat2[i] <- abs(median(PermSamples[d$feed == "casein",i]) -
median(PermSamples[d$feed == "meatmeal",i]))
}
# before going too far with this,
# let's remind ourselves of
# the TEST STATS
test.stat1; test.stat2
# and, take a look at the first 15
# permutation-TEST STATS for 1 and 2
round(Perm.test.stat1[1:15], 1)
round(Perm.test.stat2[1:15], 1)
# and, let's calculate the permutation p-value
# notice how we can ask R a true/false question
(Perm.test.stat1 >= test.stat1)[1:15]
# and if we ask for the mean of all of those,
# it treats 0 = FALSE, 1 = TRUE
mean((Perm.test.stat1 >= test.stat1)[1:15])
# Calculate the p-value, for all P = 100,000
mean(Perm.test.stat1 >= test.stat1)
# and, let's calculate the p-value for
# option 2 of the test statistic (abs diff in medians)
mean(Perm.test.stat2 >= test.stat2)
Output:
> print(d)
weight feed
1 325 meatmeal
2 257 meatmeal
3 303 meatmeal
4 315 meatmeal
5 380 meatmeal
6 153 meatmeal
7 263 meatmeal
8 242 meatmeal
9 206 meatmeal
10 344 meatmeal
11 258 meatmeal
12 368 casein
13 390 casein
14 379 casein
15 260 casein
16 404 casein
17 318 casein
18 352 casein
19 359 casein
20 216 casein
21 222 casein
22 283 casein
23 332 casein
> names(d)
[1] "weight" "feed"
> levels(d$feed)
[1] "casein" "meatmeal"
> table(d$feed)
casein meatmeal
12 11

> mean(d$weight[d$feed == "casein"]) # mean for casein
[1] 323.5833
> mean(d$weight[d$feed == "meatmeal"]) # mean for meatmeal
[1] 276.9091
> test.stat1
[1] 46.67424
> median(d$weight[d$feed == "casein"]) # median for casein
[1] 342
> median(d$weight[d$feed == "meatmeal"]) # median for meatmeal
[1] 263
> test.stat2
[1] 79
> PermSamples[, 1:5]
[,1] [,2] [,3] [,4] [,5]
[1,] 379 283 380 352 206
[2,] 380 303 258 260 380
[3,] 257 206 379 380 153
[4,] 283 242 222 404 359
[5,] 222 260 325 258 258
[6,] 315 352 153 379 263
[7,] 352 263 263 325 325
[8,] 153 325 315 359 216
[9,] 368 379 344 242 260
[10,] 344 258 368 368 257
[11,] 359 257 206 257 315
[12,] 206 153 404 222 303
[13,] 404 344 303 390 390
[14,] 325 318 318 303 352
[15,] 242 404 332 263 404
[16,] 390 380 257 206 379
[17,] 260 332 216 315 318
[18,] 303 359 352 344 368
[19,] 263 222 242 283 222
[20,] 332 368 260 332 344
[21,] 318 315 283 318 283
[22,] 216 390 390 153 332
[23,] 258 216 359 216 242
> test.stat1; test.stat2
[1] 46.67424
[1] 79
> round(Perm.test.stat1[1:15], 1)
[1] 17.1 32.4 17.6 47.1 56.1 28.9 31.0 40.8 6.8 13.8 9.1 46.5 28.9 50.9 32.7
> round(Perm.test.stat2[1:15], 1)
[1] 61.0 75.0 4.5 59.0 78.0 17.0 62.0 38.5 4.5 16.0 23.0 60.5 63.5 75.0 37.0
> (Perm.test.stat1 >= test.stat1)[1:15]
[1] FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
> mean((Perm.test.stat1 >= test.stat1)[1:15])
[1] 0.2
> mean(Perm.test.stat1 >= test.stat1)
[1] 0.09959
> mean(Perm.test.stat2 >= test.stat2)
[1] 0.05407
Similar Reads
Hypothesis Testing in R Programming
A hypothesis is made by the researchers about the data collected for any experiment or data set. A hypothesis is an assumption made by the researchers that are not mandatory true. In simple words, a hypothesis is a decision taken by the researchers based on the data of the population collected. Hypo
6 min read
Pearson Correlation Testing in R Programming
Correlation is a statistical measure that indicates how strongly two variables are related. It involves the relationship between multiple variables as well. For instance, if one is interested to know whether there is a relationship between the heights of fathers and sons, a correlation coefficient c
5 min read
Bartlettâs Test in R Programming
In statistics, Bartlett's test is used to test if k samples are from populations with equal variances. Equal variances across populations are called homoscedasticity or homogeneity of variances. Some statistical tests, for example, the ANOVA test, assume that variances are equal across groups or sam
5 min read
T-Test Approach in R Programming
The T-Test is a statistical method used to determine whether there is a significant difference between the means of two groups or between a sample and a known value.For Example: businessman who owns two sweet shops in a town. He wants to know if there's a significant difference in the average number
5 min read
Fisherâs F-Test in R Programming
In this article, we will delve into the fundamental concepts of the F-Test, its applications, assumptions, and how to perform it using R programming. We will also provide a step-by-step guide with examples and visualizations to help you master the F-Test in the R Programming Language.What is Fisherâ
4 min read
Leveneâs Test in R Programming
Levene's test is an inferential statistic used to assess whether the variances of a variable are equal across two or more groups, especially when the data comes from a non-normal distribution. This test checks the assumption of homoscedasticity (equal variances) before conducting tests like ANOVA. I
3 min read
Fligner-Killeen Test in R Programming
The Fligner-Killeen test is a non-parametric test for homogeneity of group variances based on ranks. It is useful when the data are non-normally distributed or when problems related to outliers in the dataset cannot be resolved. It is also one of the many tests for homogeneity of variances which is
3 min read
Conditional Inference Trees in R Programming
Conditional Inference Trees is a non-parametric class of decision trees and is also known as unbiased recursive partitioning. It is a recursive partitioning approach for continuous and multivariate response variables in a conditional inference framework. To perform this approach in R Programming, ct
5 min read
Contingency Tables in R Programming
Prerequisite: Data Structures in R ProgrammingContingency tables are very useful to condense a large number of observations into smaller to make it easier to maintain tables. A contingency table shows the distribution of a variable in the rows and another in its columns. Contingency tables are not o
6 min read
Homogeneity of Variance Test in R Programming
In statistics, a sequence of random variables is homoscedastic if all its random variables have the same finite variance. This is also known as homogeneity of variance. In this article, let's explain methods for checking the homogeneity of variances test in R programming across two or more groups. S
7 min read