Advanced Statistics
Advanced Statistics
College Department
Advanced Statistics
Raffy S. Centeno
Outline
1. R Basics
Data Encoding and Cleaning
Data Frames
Math Operators
2. Descriptive Statistics
Measures of Central Tendency
Measures of Variability
Outline
3. Organizing Data
Summary Table
Contingency Table
Frequency Distribution
Percentage Distribution
4. Visualizing Data
Pie Chart
Bar Chart
Histogram
Scatter Plot
Box Plot
Outline
5. Hypothesis Testing
Introduction to Hypothesis Testing
The Rare Event Rule
Null and Alternative Hypotheses
Formulating Null and Alternative Hypotheses (One Sample)
One-tailed and Two-tailed Tests
R Function
object1 = scan()
Matrix Function
The matrix function allows user to create a matrix of any size from
a vector object encoded in R.
R Function
object2 = matrix(object1, nrow=number.row,
ncol=number.column)
Read CSV Function
Reads a file in table format and creates a data frame from it, with
cases corresponding to lines and variables to fields in the file. This
function is used when importing files from a spreadsheet (ie. Excel).
Make sure that the file to be imported is a CSV file.
R Function
object1 = read.csv(file.choose(),header=T)
Data Frames
Data Frame
The function creates data frames, tightly coupled collections of vari-
ables which share many of the properties of matrices and of lists, used
as the fundamental data structure by most of R’s modeling software.
R Function
object1 = data.frame(object2, object3, ...)
Accessing objects in a Data Frame
Accessing an object in a Data Frame is easy. Just follow the syntax
given below.
R Function
object1$column.name
Math Operators
Operation Syntax Description
Addition + Adds numeric values in a vector, matrix,
or data frame
Subtraction − Subtracts numeric values in a vector, ma-
trix, or data frame
Multiplication ∗ Multiplies numeric values in a vector,
matrix*, or data frame
Division / Divides numeric values in a vector, ma-
trix, or data frame
∧
Exponent Performs an exponent operator in a vec-
tor, matrix, or data frame
Operation Syntax Description
Matrix Multiplication % ∗ % Perform matrix multiplication
Modular Arithmetic %% Compute for the remainder of a
division
Integer Division %/% Divide numbers but rounded
down
Parenthesis () Changes the order of the opera-
tions
Chapter II
Descriptive Statistics
Definition
Descriptive statistics are brief descriptive coefficients that sum-
marize a given data set, which can be either a representation of
the entire population or a sample of it. It do not actually test any
hypotheses.
Measures of Central Tendency
Definition
A measure of central tendency is a single value that attempts to
describe a set of data by identifying the central position within that
set of data. As such, measures of central tendency are sometimes
called measures of central location. They are also classed as summary
statistics.
Definition
The arithmetic mean is a mathematical representation of the typ-
ical value of a series of numbers, computed as the sum of all the
numbers in the series divided by the count of all numbers in the
series.
R Function
mean(object1)
When to use the Arithmetic Mean
I If the measurement scale of your data is at least interval.**
I If your data is normally distributed.
I If your data has no significant outlier(s).
Remarks **
This may differ depending on the school of thought you belong to.
Definition
The median is the middle score for a set of data that has been
arranged in order of magnitude.
R Function
median(object1)
When to use the Median
I If the measurement scale of your data is at least ordinal.
Measures of Variability
Definition
A measure of variablity, sometimes also called a measure of dis-
persion or measure of spread, is used to describe the variability in a
sample or population. It is usually used in conjunction with a mea-
sure of central tendency, such as the mean or median, to provide an
overall description of a set of data.
Importance
A measure of variability gives us an idea of how well the mean, for
example, represents the data. If the spread of values in the data set
is large, the mean is not as representative of the data as if the spread
of data is small. This is because a large spread indicates that there
are probably large differences between individual scores. Additionally,
in research, it is often seen as positive if there is little variation in
each data group as it indicates that the similar.
Definition
The standard deviation is a measure of the spread of scores within
a set of data.
R Function
sd(object1) and var(object1)
When to use Standard Deviation and Variance
Standard deviation and variance are used when the arithmetic
mean is used to calculate central tendency.
Definition
The range is the difference between the highest and lowest scores in
a data set and is the simplest measure of spread.
R Function
max(object1) − min(object1)
When to use Range
If you are measuring a variable that has either a critical low or high
threshold (or both) that should not be crossed. The range will in-
stantly inform you whether at least one value broke these critical
thresholds.
Example 2.1
During the first marking period, Nicole’s math quiz scores were as
follows:
Solve for the mean, median, sample variance, sample standard devi-
ation, and the range of the data.
Example 2.2
Thirty AAA batteries were tested to determine how long they would
last. The results, to the nearest minute, were recorded as follows:
423, 369, 387, 411, 393, 394, 371, 377, 389, 409
392, 408, 431, 401, 363, 391, 405, 382, 400, 381
399, 415, 428, 422, 396, 372, 410, 419, 386, 390
Solve for the mean, median, sample variance, sample standard devi-
ation, and the range of the data.
Example 2.3
A survey was taken on Maple Avenue. In each of 20 homes, people
were asked how many cars were registered to their households. The
results were recorded as follows:
1, 2, 1, 0, 3, 4, 0, 1, 1, 1
2, 2, 3, 2, 3, 2, 1, 4, 0, 0
Solve for the mean, median, sample variance, sample standard devi-
ation, and the range of the data.
Chapter II
Organizing Data
Summary Table
The Summary Table
The summary table presents tallied responses as frequencies or per-
centages for each category.
R Function
table(object1) or prop.table(object1)
Contingency Table
The Contingency Table
The contingency table a table showing the distribution of one vari-
able in rows and another in columns, used to study the association
between the two variables.
R Function
xtabs(∼ var1+var2,object1)
Frequency Distribution
Frequency Distribution
The frequency distribution summarizes numerical values by tallying
them into a set of numerically ordered classes.
R Function
breaks = seq(data.low, data.high, by=class.width)
object2 = cut(object1, breaks, right=F)
table(object2)
Percentage Distribution
Percentage Distribution
The percentage distribution summarizes numerical values by list-
ing the percentage of each group.
When you are comparing two or more groups, knowing the percentage
of the total that is in each group is more useful than knowing the
frequency count of each group.
R Function
breaks = seq(data.low, data.high, by=class.width)
object2 = cut(object1, breaks, right=F)
prop.table(table(object2))
Chapter III
Visualizing Data
Pie Chart
Pie Chart
A pie chart uses parts of a circle to represent the tallies of each
category. The size of each part, or pie slice, varies according to the
percentage in each category.
R Function
pie(table(object1), main=”Title of the Chart”)
Bar Chart
Bar Chart
A bar chart compares different categories by using individual bars to
represent the tallies for each category. The length of a bar represents
the amount, frequency, or percentage of values falling into a category.
R Function
barplot(table(object1), main=”Title of the Chart”,
xlab=”x-axis label”, ylab=”y-axis lable”)
Histogram
The Histogram
A histogram is a bar chart for grouped numerical data in which you
use vertical bars to represent the frequencies or percentages in each
group. In a histogram, there are no gaps between adjacent bars.
R Function
hist(object1, breaks=number.of.class,
main = ”Title of the Chart”,
xlab=”x-axis label”, ylab=”y-axis label”)
Scatter Plot
Scatter Plot
Scatter Plot is a useful summary of a set of bivariate data (two
variables), usually drawn before working out a linear correlation or
fitting a regression line.
R Function
plot(object.on.x-axis, object.on.y-axis,
main = ”Title of the Chart”,
xlab=”x-axis label”, ylab=”y-axis label”)
Box Plot
Box Plot
The box plot (a.k.a. box and whisker diagram) is a standardized
way of displaying the distribution of data based on the five number
summary: minimum, first quartile, median, third quartile, and
maximum. Box plots are used to check outliers and perform
non-parametric tests in identifying differences in groups
R Function
boxplot(object1)
or
boxplot(data∼group, object1,
main = ”Title of the Chart”,
xlab=”x-axis label”, ylab=”y-axis label”)
Chapter IV
Hypothesis Testing
Introduction to Hypothesis Making
Definition
In statistics, a hypothesis is an assumption about certain
characteristics of a population.
Definition
A hypothesis test is a statistical test that is used to determine
whether there is enough evidence in a sample of data to infer that a
certain condition is true for the entire population.
Remarks
The null hypothesis always includes the equal sign. Usually, it is a
statement of ”no effect” or ”no difference”.
Definition
An alternative hypothesis, denoted by Ha or H1 , is a hypothesis
used in hypothesis testing that is contrary to the null hypothesis.
Therefore, an alternative hypothesis is the negation of your null hy-
pothesis.
Remarks
The alternative hypothesis does not include the equal sign and it is
the statement you want to be able to conclude is true.
Note
If we reject the null hypothesis, we automatically accept the alter-
native hypothesis because we were able to find evidence against it.
However, we do not accept the null hypothesis. We just failed
to reject it.
Answer
H0 : p = 0.80
Ha : p 6= 0.80
Example 2
Claim: According to research, the average life expectancy of Filipinos
is 68 years old.
Example 2
Claim: According to research, the average life expectancy of Filipinos
is 68 years old.
Answer
H0 : µ = 68 years
Ha : µ 6= 68 years
Example 3
Claim: Your friend claims that Coke ”litro” is less than 1 liter.
Example 3
Claim: Your friend claims that Coke ”litro” is less than 1 liter.
Answer
H0 : µ = 1 liter
Ha : µ < 1 liter
Example 4
Claim: A teacher is claiming that the average IQ of his students is
greater than 100.
Example 4
Claim: A teacher is claiming that the average IQ of his students is
greater than 100.
Answer
H0 : µ = 100
Ha : µ > 100
Example 5
Claim: The average weekly earning for men is higher than $670, the
women’s average.
Example 5
Claim: The average weekly earning for men is higher than $670, the
women’s average.
Answer
H0 : µ = $ 670
Ha : µ > $ 670
Example 6
Claim: The daily yield for a local chemical plant has averaged 880
tons for the last several years. The quality control manager would
like to know whether this average has changed in recent months.
Example 6
Claim: The daily yield for a local chemical plant has averaged 880
tons for the last several years. The quality control manager would
like to know whether this average has changed in recent months.
Answer
H0 : µ = 880 tons
Ha : µ 6= 880 tons
One-tailed and Two-tailed Tests
Definition
A two-tailed test states that the null hypothesis is wrong. A two-
tailed alternative hypothesis does not predict whether the parameter
of interest is larger or smaller than the reference value specified in
the null hypothesis.
H0 : µ = 68 years
Ha : µ 6= 68 years
Example of a Two-tailed Test
You want to check if the average height of male Filipinos is 162
centimeters
H0 : µ = 162 cm
Ha : µ 6= 162 cm
Example of a Two-tailed Test
A teacher want to check if the IQ of his students is equal to 100.
H0 : µ = 180
Ha : µ 6= 180
Definition
A one-tailed test states that the null hypothesis is wrong, and also
specifies whether the true value of the parameter is greater than or
less than the reference value specified in null hypothesis.
The alternative hypothesis’ inequality is either greater than or less
than > or <.
Remarks
The advantage of using a one-tailed test is increased power to
detect the specific effect you are interested in. The disadvantage is
that there is no power to detect an effect in the opposite direction.
Example of a One-tailed Test
You want to test if the average life expectancy of Filipinos is greater
than 68 years old. Your hypotheses will be the following:
H0 : µ = 68 years
Ha : µ > 68 years
Example of a One-tailed Test
You want to check if the average height of male Filipinos is less than
162 centimeters
H0 : µ = 162 cm
Ha : µ < 162 cm
Example of a One-tailed Test
A teacher want to check if the IQ of his students is greater than 100.
H0 : µ = 180
Ha : µ > 180
Chapter V
Formulating a Conclusion in
Hypothesis Testing
Types of Error in Hypothesis Testing
Type I Error
When the null hypothesis is true and you reject it, you make
a type I error. The probability of making a type I error is α, which is
the level of significance you set for your hypothesis test.
To lower this risk, you must use a lower value for α. However, using
a lower value for alpha means that you will be less likely to detect a
true difference if one really exists.
Remarks
You must choose your significance level before you begin your study.
It protects you from choosing a significance level because it conve-
niently gives you significant results!
The common alpha values of 0.05 and 0.01 are simply based on
tradition. For a significance level of 0.05, expect to obtain sample
means in the critical region 5% of the time when the null hypothesis
is true.
Type II Error
When the null hypothesis is false and you fail to reject it, you
make a type II error. The probability of making a type II error is β,
which depends on the power of the test.
Remarks
Always report the p-value so your readers can draw their own con-
clusions.
Example Scenario 1
You want to check if the average height of male Filipinos is 162 cm.
H0 : µ = 162 cm
Ha : µ 6= 162 cm
Then, you conducted a survey with correct sampling procedures and
research design. You found out that the average height of your
respondents is 163.5 cm with sample standard deviation of 4 cm.
After running the analysis, you obtained a p-value of 0.3547. What
will be your conclusion if your α = 0.05?
Example Scenario 2
Your friend claims that Coke ”litro” is less than 1 liter.
H0 : µ = 1 liter
Ha : µ < 1 liter
Then, you measure the volume of the Coke litro everytime you buy
one. After a couple of days, you were able to measure 50 Coke litros
and found out that the average volume of your sample is 0.99 liter
with sample standard deviation of 0.09 liter. Suppose you were able
to measure accurately all your samples, will you support your friend’s
claim if your p-value is 0.6742 at α = 0.05?
Example 3
At α = 0.05, what is your conclusion about the given result:
p-value = 0.9874
Example 3
At α = 0.05, what is your conclusion about the given result:
p-value = 0.9874
Conclusion
We failed to reject our null hypothesis.
Example 4
At α = 0.05, what is your conclusion about the given result:
p-value = 0.3746
Example 4
At α = 0.05, what is your conclusion about the given result:
p-value = 0.3746
Conclusion
We failed to reject our null hypothesis.
Example 5
At α = 0.05, what is your conclusion about the given result:
p-value = 0.0393
Example 5
At α = 0.05, what is your conclusion about the given result:
p-value = 0.0393
Conclusion
We reject our null hypothesis and accept our alternative hypothesis.
Example 6
At α = 0.01, what is your conclusion about the given result:
p-value = 0.0393
Example 6
At α = 0.01, what is your conclusion about the given result:
p-value = 0.0393
Conclusion
We failed to reject our null hypothesis.
Example 7
At α = 0.01, what is your conclusion about the given result:
p-value = 0.0523
Example 7
At α = 0.01, what is your conclusion about the given result:
p-value = 0.0523
Conclusion
We failed to reject our null hypothesis.
Example 8
At α = 0.01, what is your conclusion about the given result:
p-value = 0.0523
Example 8
At α = 0.01, what is your conclusion about the given result:
p-value = 0.0523
Conclusion
We failed to reject our null hypothesis.
Chapter VI
Statistical Test for the
Assumption of Normality
Statistical Test for the
Assumption of Normality
Definition
The Shapiro-Wilk test for normality is one of three general nor-
mality tests designed to detect all departures from normality. Its
hypotheses are as follows:
Teacher 1 2 3 4 5 6 7 8 9 10
Salary 15 18 16 14 15 15 12 17 30 35
Method 1 Method 2
0.011 0.013 0.013 0.011 0.016 0.013
0.015 0.014 0.013 0.012 0.015 0.012
0.010 0.013 0.011 0.017 0.013 0.014
0.012 0.015
H0 : µ = m0
Alternative Hypothesis - Two Tailed
Ha : There is a significant difference between the true mean µ and
the comparison value m0
Ha : µ 6= m0
Alternative Hypothesis - One Tailed
Ha : The true mean µ is greater than the comparison value m0
Ha : µ > m0
Alternative Hypothesis - One Tailed
Ha : The true mean µ is less than the comparison value m0
Ha : µ < m0
Scenario 1
Suppose you are interested in determining whether an assembly line
produces laptop computers that weigh five pounds.
Scenario 1
Suppose you are interested in determining whether an assembly line
produces laptop computers that weigh five pounds.
Scenario 2
You want to check if the average height of male Filipinos is 162 cm.
One Sample T-test
Use
It is a parametric statistical procedure used to examine the mean
difference between the sample and the known value of the population
mean.
R Function
t.test(object1, mu=comparison value, alternative=” ”)
Assumptions
I Your variable should be measured at the interval or ratio level.
I The data are independent, which means that there is no rela-
tionship between the observations.
I There should be no significant outliers.
I Your dependent variable should be approximately normally dis-
tributed.
One Sample Wilcoxon Test
Use
It is a non-parametric alternative to a one-sample t-test. The test
determines whether the median of the sample is equal to some speci-
fied value. Use this test if your data violates at least one assumption
in your parametric test.
R Function
wilcox.test(object1, mu=comparison value, alternative=” ”)
Assumptions
I Your variable should be measured at the ordinal level or higher.
I The data are independent, which means that there is no rela-
tionship between the observations.
START
One
One Is your data No Sample
Sample normally
Yes Wilcoxon
T-test distributed?
Test
Consider the wages, in thousands, of the teachers at Holy Cross
College of Marilog below:
Teacher 1 2 3 4 5 6 7 8 9 10
Salary 15 18 16 14 15 15 12 17 30 35
H0 : µ d = 0
Alternative Hypothesis - Two Tailed
Ha : The true mean difference (µd ) is not equal to zero.
Ha : µd 6= 0
Alternative Hypothesis - One Tailed
Ha : The true mean difference (µd ) is greater than zero.
Ha : µ d > 0
Alternative Hypothesis - One Tailed
Ha : The true mean difference (µd ) is less than zero.
Ha : µ d < 0
Scenario 1
You are a teacher and you want to check if your students learned
something in your discussion. You conducted a Pre-test and Post-
test to measure their learnings before and after the discussion.
Scenario 1
You are a teacher and you want to check if your students learned
something in your discussion. You conducted a Pre-test and Post-
test to measure their learnings before and after the discussion.
Scenario 2
You are a coach and you want to check if your players improved after
taking some enhancement drugs.
Paired T-test
Use
The paired sample t-test, sometimes called the dependent sample
t-test, is a parametric statistical procedure used to determine whether
the mean difference between two sets of observations is zero. In
a paired sample t-test, each subject or entity is measured twice,
resulting in pairs of observations.
R Function
t.test(object1, object2, paired=T, alternative=” ”)
Assumptions
I Your variable should be measured at the interval or ratio level.
I Your independent variable should consist of two catergorical,
”related groups” or ”matched pairs.”
I There should be no significant outliers.
I The differences between two sets of values should be approxi-
mately normally disstributed.
Wilcoxon Signed-Rank Test
Use
The Wilcoxon Signed-Rank test is a non-parametric statistical
hypothesis test used when comparing two related samples, matched
samples, or repeated measurements on a single sample to assess
whether their population mean ranks differ.
R Function
wilcox.test(object1, object2, paired=T, alternative=” ”)
Assumptions
I Your variable should be measured at the ordinal level or higher.
I The data are independent, which means that there is no rela-
tionship between the observations.
START
Does it appear that the average recall score is higher when imagery
is used?
With Without With Without
Pupil Pupil
Imagery Imagery Imagery Imagery
1 20 5 11 17 8
2 24 9 12 20 16
3 20 5 13 20 10
4 18 9 14 16 12
5 22 6 15 24 7
6 19 11 16 22 9
7 20 8 17 25 21
8 19 11 18 21 14
9 17 7 19 19 12
10 21 9 20 23 13
A taxi company manager is trying to decide whether the use of radial
tires instead of regular belted tires improves fuel economy. Twelve
cars were equipped with radial tires and driven over a prescribed test
course. Without changing drivers, the same cars were then equipped
with regular belted tires and driven once again over the test course.
The gasoline consumption, in kilometers per liter, was recorded.
Can we conclude that cars equipped with radial tires give better fuel
economy than those equipped with belted tires at α = 0.05?
Car Radial Tires Belted Tires Car Radial Tires Belted Tires
1 4.2 4.1 7 5.7 5.7
2 4.7 4.9 8 6.0 5.8
3 6.6 6.2 9 7.4 6.9
4 7.0 6.9 10 4.9 4.7
5 6.7 6.8 11 6.1 6.0
6 4.5 4.4 12 5.2 4.9
Chapter IX
Statistical Test for the
Assumption of Homoscedasticity
for Two Sample Data
Statistical Test for the Assumption of
Homoscedasticity
Definition
An F-test is used to test if the variances of two populations are
equal. Its hypotheses are as follows:
R Function
var.test(object1, object2)
Some Alternative Tests
I Bartlett’s Test
I Levene’s Test
I Chi-Square Test
I Residual Plot
Chapter X
Tests Concerning Means/Medians
for Two Samples
Use
The Independent Samples Test compares the means/medians of
two independent groups in order to determine whether there is sta-
tistical evidence that the associated population means/medians are
significantly different.
Hypotheses
Null Hypothesis
H0 : The population means/medians from the two unrelated groups
are equal.
H0 : µ1 = µ2
Alternative Hypothesis - Two Tailed
Ha : The population means/medians from the two unrelated groups
are not equal.
Ha : µ1 6= µ2
Alternative Hypothesis - One Tailed
Ha : The population mean/median from group one is greater than
population mean/median from group two.
Ha : µ1 > µ2
Alternative Hypothesis - One Tailed
Ha : The population mean/median from group one is less than
population mean/median from group two.
Ha : µ1 < µ2
Scenario 1
You want to check whether there is a significant (or only random)
difference in the average cycle time to deliver a pizza from Pizza
Company A vs. Pizza Company B.
Scenario 1
You want to check whether there is a significant (or only random)
difference in the average cycle time to deliver a pizza from Pizza
Company A vs. Pizza Company B.
Scenario 2
Do two types of music, type-I and type-II, have different effects upon
the ability of college students to perform a series of mental tasks
requiring concentration?
Two Sample T-test
Use
The two-sample t-test is a hypothesis test for answering questions
about the mean where the data are collected from two random sam-
ples of independent observations.
R Function
t.test(object1, object2, var.equal=T, alternative=” ”)
Assumptions
I Your variable should be measured at the interval or ratio level.
I The data are independent, which means that there is no
relationship between the observations.
I There should be no significant outliers.
I Your dependent variable should be approximately normally
distributed.
I The variance of each group should be equal.
Two Sample T-test with Welch
Correction
Use
Welch’s t-test is an adaptation of Student’s t-test, that is, it has
been derived with the help of Student’s t-test and is more reliable
when the two samples have unequal variances.
R Function
t.test(object1, object2, var.equal=F, alternative=” ”)
Assumptions
I Your variable should be measured at the interval or ratio level.
I The data are independent, which means that there is no
relationship between the observations.
I There should be no significant outliers.
I Your dependent variable should be approximately normally
distributed.
Mann-Whitney Test
Use
The Mann-Whitney U test is used to compare differences between
two independent groups when the dependent variable is either ordinal
or continuous, but not normally distributed.
R Function
wilcox.test(object1, object2, alternative=” ”)
Assumptions
I Your variable should be measured at the ordinal level or higher.
I The data are independent, which means that there is no
relationship between the observations.
Is your data No Mann-
START normally Whitney
distributed? Test
Yes
Two
Two Are your Sample
No
Sample variances T-test
T-test Yes equal? with Welch
Correction
We want to compare the heights in inches of two groups of individ-
uals. The data are given below.
Group A 175 168 168 190 156 181 182 175 174 179
Group B 120 180 125 188 130 190 110 185 112 188
Woman 38.9 71.2 73.3 21.8 63.4 64.6 38.4 28.8 28.5
Man 67.8 60.0 63.4 76.0 89.4 73.3 67.3 61.3 62.4
Test at α = 0.05 if the average weight of the high protein diet rats
is significantly greater than the other group.
Chapter XI
Statistical Test for the
Assumption of Homoscedasticity
for More than Two Sample Data
Definition
Bartlett’s test is used to test if k samples are from populations
with equal variances. Equal variances across populations is called
homoscedasticity or homogeneity of variances.
R Function
bartlett.test(data ∼ group, data = object)
Some Alternative Tests
I Levene’s Test
I Standard Deviation Plot
I Boxplots
Chapter XII
Tests Concerning Means/Medians
for More than Two Samples
Use
Tests concerning means/medians for more than two samples
compares the means of two or more independent groups in order to
determine whether there is statistical evidence that the associated
population means/medians are significantly different.
Hypotheses
Null Hypothesis
H0 : The population mean/median is equal for all groups.
H0 : µ1 = µ2 = ... = µn
Alternative Hypothesis - Two Tailed
Ha : At least one population mean/median is significantly different
from the others.
Scenario 1
Suppose we want to test the effect of five different exercises. For
this, we recruit 20 men and assign one type of exercise to 4 men (5
groups). Their weights are recorded after a few weeks.
Scenario 1
Suppose we want to test the effect of five different exercises. For
this, we recruit 20 men and assign one type of exercise to 4 men (5
groups). Their weights are recorded after a few weeks.
Scenario 2
You want to study the effect of fertilizers on yield of wheat. We
apply five fertilizers, each of different quality, on five plots of land
each of wheat. The yield from each plot of land is recorded and the
difference in yield among the plots is observed.
The Analysis of Variance
Use
The one-way analysis of variance (ANOVA) is used to determine
whether there are any statistically significant differences between the
means of three or more independent (unrelated) groups.
R Function
oneway.test(data ∼ group, data = object1, var.equal = T)
Assumptions
I Your variable should be measured at the interval or ratio level.
I The data are independent, which means that there is no
relationship between the observations.
I There should be no significant outliers.
I Your dependent variable should be approximately normally
distributed.
I The variance of each group should be equal.
ANOVA with Welch Correction
Use
Welch’s ANOVA compares three or more means to see if they are
equal. It is an alternative to the Classic ANOVA and can be used even
if your data violates the assumption of homogeneity of variances.
R Function
oneway.test(data ∼ group, data = object1, var.equal = F)
Assumptions
I Your variable should be measured at the interval or ratio level.
I The data are independent, which means that there is no
relationship between the observations.
I There should be no significant outliers.
I Your dependent variable should be approximately normally
distributed.
Kruskal-Wallis Test
Use
The Kruskal-Wallis H test (sometimes also called the ”one-way
ANOVA on ranks”) is a rank-based nonparametric test that can be
used to determine if there are statistically significant differences be-
tween two or more groups of an independent variable on a continuous
or ordinal dependent variable.
R Function
kruskal.test(data ∼ group, data = object1)
Assumptions
I Your variable should be measured at the interval or ratio level.
I The data are independent, which means that there is no
relationship between the observations.
Is your data No Kruskal-
START normally Wallis
distributed? Test
Yes
Spray A 10 7 20 14 14 12 10 23 17 20 14 13
Spray B 11 17 21 11 16 14 17 17 19 21 7 13
Spray C 3 5 3 5 3 6 1 1 3 2 6 4
Spray D 11 9 15 22 15 16 13 10 26 26 24 13
No Exercise 23 26 51 49 58 37 29 44
20 minutes 22 27 39 29 46 48 49 65
60 minutes 59 66 38 49 56 60 56 62
H0 : ρ = 0
Alternative Hypothesis - Two Tailed
Ha : There is a significant relationship between two variables.
Ha : ρ 6= 0
Alternative Hypothesis - One Tailed
Ha : There is a significant positive relationship between two
variables.
Ha : ρ > 0
Alternative Hypothesis - One Tailed
Ha : There is a significant negative relationship between two
variables.
Ha : ρ < 0
Interpretation of Correlation Coefficient
Correlation Coefficent Degree of Correlation
ρ = 1.00 Perfect Positive Correlation
0.80 ≤ ρ < 1.00 Very Strong Positive Correlation
0.60 ≤ ρ < 0.80 Strong Positive Correlation
0.40 ≤ ρ < 0.60 Moderate Positive Correlation
0.20 ≤ ρ < 0.40 Weak Positive Correlation
0 < ρ < 0.20 Very Weak Positive Correlation
ρ=0 No Correlation
Correlation Coefficent Degree of Correlation
ρ = −1.00 Perfect Negative Correlation
−0.80 ≤ ρ < −1.00 Very Strong Negative Correlation
−0.60 ≤ ρ < −0.80 Strong Negative Correlation
−0.40 ≤ ρ < −0.60 Moderate Negative Correlation
−0.20 ≤ ρ < −0.40 Weak Negative Correlation
0 < ρ < −0.20 Very Weak Negative Correlation
Scenario 1
You want to identify the level of linear relationship between the age
and IQ.
Scenario 1
You want to identify the level of linear relationship between the age
and IQ.
Scenario 2
You want to check if there is a significant linear relationship between
the behavior and the academic performance of your students.
Pearson Coefficient of Correlation
Use
The Pearson Correlation Coefficient is a measure of the linear
dependence between two variables X and Y, giving a value between
+1 and −1 inclusive, where 1 is total positive linear correlation, 0 is
no linear correlation, and −1 is total negative linear correlation.
R Function
cor.test(object1, object2, method=”pearson”, alternative=” ”)
Assumptions
I Your data should be measured at the interval or ratio level.
I Each participant or observation should have a pair of values.
I There should be no significant outliers.
I There is a linear relationship between your variables.
I Your variables should be approximately normally distributed.
I The variance of each group should be equal.
Spearman Rank Coefficient of Correlation
Use
The Spearman Rank Correlation Coefficient is a nonparametric
measure of rank correlation between two variables. It assesses how
well the relationship between two variables can be described using a
monotonic function.
R Function
cor.test(object1, object2, method=”spearman”, alternative=” ”)
Definition
A monotonic relationship is a relationship that does one of the
following:
I as the value of one variable increases, so does the value of the
other variable; or
I as the value of one variable increases, the other variable value
decreases.
Assumptions
I Your data should be measured at least ordinal scale.
I Each participant or observation should have a pair of values.
Yes Yes Equal
START Linear? Normal?
Variances?
No No No Yes
Spearman
Pearson
Rank
Correlation
Correlation
Coefficient
Coefficient
The popular ice cream franchise Coldstone Creamery posted the nu-
tritional information for its ice cream offerings in three serving sizes
- ”Like it”, ”Love it”, and ”Gotta Have it” - on their website. A
portion of that information for the ”Like it” serving is shown in the
table in the next slide.
Analyze the data and identify the level of linear relationship between
Calories and Total Fat (grams) and check if the linear relationship
is significant using Pearson Correlation Coefficient at α = 0.05. As-
sume that the data passes the assumptions of Pearson Correlation
Coefficient.
Flavor Calories Total Fat (grams)
Cake Batter 340 19
Cinnamon Bun 370 21
French Toast 330 19
Mocha 320 20
OREO Creme 440 31
Peanut Butter 370 24
Strawberry Cheesecake 320 21
Math is simply beautiful, almost perfect and it takes time
and creativity to uncover its mask.
-Anonymous