Shapiro–Wilk Test in R Programming
Last Updated :
16 Jul, 2020
The Shapiro-Wilk's test or Shapiro test is a normality test in frequentist statistics. The null hypothesis of Shapiro's test is that the population is distributed normally. It is among the three tests for normality designed for detecting all kinds of departure from normality. If the value of p is equal to or less than 0.05, then the hypothesis of normality will be rejected by the Shapiro test. On failing, the test can state that the data will not fit the distribution normally with 95% confidence. However, on passing, the test can state that there exists no significant departure from normality. This test can be done very easily in R programming.
Shapiro-Wilk's Test Formula
Suppose a sample, say x1,x2.......xn, has come from a normally distributed population. Then according to the Shapiro-Wilk's tests null hypothesis test
W=\frac{(\sum_{i=1}^n a_ix_{(i)})^2}{(\sum_{i=1}^n x_i - \bar{x})^2}
where,
- x(i) : it is the ith smallest number in the given sample.
- mean(x) : ( x1+x2+......+xn) / n i.e the sample mean.
- ai : coefficient that can be calculated as (a1,a2,....,an) = (mT V-1)/C . Here V is the covariance matrix, m and C are the vector norms that can be calculated as C= || V-1 m || and m = (m1, m2,......, mn ).
Implementation in R
To perform the Shapiro Wilk Test, R provides shapiro.test() function.
Syntax:
shapiro.test(x)
Parameter:
x : a numeric vector containing the data values. It allows missing values but the number of missing values should be of the range 3 to 5000.
Let us see how to perform the Shapiro Wilk's test step by step.
- Step 1: At first install the required packages. The two packages that are required to perform the test are dplyr. The dplyr package is needed for efficient data manipulation. One can install the packages from the R console in the following way:
install.packages("dplyr")
- Step 2: Now load the installed packages into the R Script. It can be done by using the library() function in the following way.
R
# loading the package
library(dplyr)
- Step 3: The most important task is to select a proper data set. Here let's work with the ToothGrowth data set. It is an in-built data set in the R library.
R
# loading the package
library("dplyr")
# Using the ToothGrowth data set
# loading the data set
my_data <- ToothGrowth
One can also create their own data set. For that first prepare the data, then save the file and then import the data set into the script. The file can include using the following syntax:
data <- read.delim(file.choose()) ,if the format of the file is .txt
data <- read.csv(file.choose()), if the format of the file is .csv
- Step 4: Now select a random number using the set.seed() function. Following which we start displaying an output sample of 10 rows chosen randomly using the sample_n() function of the dplyr package. This is how we check our data.
R
# loading the package
library("dplyr")
# Using the ToothGrowth package
# loading the data set
my_data <- ToothGrowth
# Using the set.seed() for
# random number generation
set.seed(1234)
# Using the sample_n() for
# random sample of 10 rows
dplyr::sample_n(my_data, 10)
Output:
len supp dose
1 11.2 VC 0.5
2 8.2 OJ 0.5
3 10.0 OJ 0.5
4 27.3 OJ 2.0
5 14.5 OJ 1.0
6 26.4 OJ 2.0
7 4.2 VC 0.5
8 15.2 VC 1.0
9 14.5 OJ 0.5
10 26.7 VC 2.0
- Step 5: At last perform the Shapiro Wilk's test using the shapiro.test() function.
R
# loading the package
library("dplyr")
# Using the ToothGrowth package
# loading the data set
my_data <- ToothGrowth
# Using the set.seed()
# for random number generation
set.seed(1234)
# Using the sample_n()
# for random sample of 10 rows
dplyr::sample_n(my_data, 10)
# Using the shapiro.test() to check
# for normality based
# on the len parameter
shapiro.test(my_data$len)
Output:
> dplyr::sample_n(my_data, 10)
len supp dose
1 11.2 VC 0.5
2 8.2 OJ 0.5
3 10.0 OJ 0.5
4 27.3 OJ 2.0
5 14.5 OJ 1.0
6 26.4 OJ 2.0
7 4.2 VC 0.5
8 15.2 VC 1.0
9 14.5 OJ 0.5
10 26.7 VC 2.0
> shapiro.test(my_data$len)
Shapiro-Wilk normality test
data: my_data$len
W = 0.96743, p-value = 0.1091
From the output obtained we can assume normality. The p-value is greater than 0.05. Hence, the distribution of the given data is not different from normal distribution significantly.
Similar Reads
Wilcoxon Signed Rank Test in R Programming
The Wilcoxon signed rank test is a non-parametric method used to compare two related groups. It works well when we have matched data or repeated measurements from the same group and want to see if there is a meaningful difference between them. This test is often used as an alternative to the paired
4 min read
Unit Testing in R Programming
The unit test basically is small functions that test and help to write robust code. From a robust code we mean a code which will not break easily upon changes, can be refactored simply, can be extended without breaking the rest, and can be tested with ease. Unit tests are of great use when it comes
5 min read
Leveneâs Test in R Programming
Levene's test is an inferential statistic used to assess whether the variances of a variable are equal across two or more groups, especially when the data comes from a non-normal distribution. This test checks the assumption of homoscedasticity (equal variances) before conducting tests like ANOVA. I
3 min read
Two-Proportions Z-Test in R Programming
Two-proportion z-test is applied to compare two proportions to determine if they are significantly different. It estimates the interval likely to contain the difference between population proportions and is based on a standard normal distribution. For a 5% two-tailed test, the critical value is 1.96
3 min read
How To Start Programming With R
R Programming Language is designed specifically for data analysis, visualization, and statistical modeling. Here, we'll walk through the basics of programming with R, from installation to writing our first lines of code, best practices, and much more. Table of Content 1. Installation2. Variables and
12 min read
Bartlettâs Test in R Programming
In statistics, Bartlett's test is used to test if k samples are from populations with equal variances. Equal variances across populations are called homoscedasticity or homogeneity of variances. Some statistical tests, for example, the ANOVA test, assume that variances are equal across groups or sam
5 min read
Kolmogorov-Smirnov Test in R Programming
Kolmogorov-Smirnov (K-S) test is a non-parametric test employed to check whether the probability distributions of a sample and a control distribution, or two samples are equal. It is constructed based on the cumulative distribution function (CDF) and calculates the greatest difference between the em
4 min read
Fisherâs F-Test in R Programming
In this article, we will delve into the fundamental concepts of the F-Test, its applications, assumptions, and how to perform it using R programming. We will also provide a step-by-step guide with examples and visualizations to help you master the F-Test in the R Programming Language.What is Fisherâ
4 min read
AB Testing With R Programming
Split testing is another name of A/B testing and it's a common or general methodology. It's used online when one wants to test a new feature or a product. The main agenda over here is to design an experiment that gives repeatable results and robust to make an informed decision to launch it or not. G
7 min read
How to Perform a Shapiro-Wilk Test in Python
In this article, we will be looking at the various approaches to perform a Shapiro-wilk test in Python. Shapiro-Wilk test is a test of normality, it determines whether the given sample comes from the normal distribution or not. Shapiro-Wilkâs test or Shapiro test is a normality test in frequentist s
2 min read