"Significance Testing 101: The Z Test - Part One": My Former Statistics Professor Used To Say That: "Our World Is Noisy."
"Significance Testing 101: The Z Test - Part One": My Former Statistics Professor Used To Say That: "Our World Is Noisy."
Part One"
What is Significance testing?
The mean
Before we dive into the test itself, there are a few key concepts we need to get familiar with.
The first and most commonly-known of them is the mean. The mean is a statistical tool that
helps us understand the distribution of data. Simply put, the mean is just the mathematical
term for average, and is the sum of all the data points in the distribution divided by the
number of numbers being averaged:
Where X1, X2Xn represent all the numbers being averaged and n is the amount of them.
The average gives us a very useful and intuitive way to understand the distribution without
having to go through all the observations in it. For example, if you know that the average
test score of a class is 98.5 out of 100, then you do not have to read all the test scores of all
the students to know that most of them did really well.
The standard deviation
Imagine that I am offering you one million dollars for answering this one question: "Is 300
above the average is a lot?". Honestly, my money is safe - because this is a trick question. In
order to answer it, more information is required. The standard deviation tells us how
stretched out the observations are from the average. If the standard deviation is small (for
example: if the biggest deviation from the average is 3) then 300 is a lot. The formula for
standard deviation is:
Where X is a single observation, is the mean, n is the amount of observations and means
"sum up all of the following". The basic idea of this equation is to sum up all of the distances
of observations from the mean (they are squared in order to account for the negative
distances canceling the positive ones), and divide it by n to make sure the standard deviation
is not affected by the amount of observations only by their distance from the average. The
standard deviation is the squared root of that number.
One of the most important distributions in statistics is the normal distribution. Here are
some examples of it:
As you can see, it is symmetrical and has only one peak. The center of the distribution is its
mean. Although there are other distributions with these properties, the normal distribution
has a very important feature that we can use: given a number of standard deviations from
its center its mean we can know the area under the curve until that point. The following
graph and example will make this idea clearer:
Three Two One
The One Two Three
standard standard standard
mean standard standard standard
deviation deviation deviation
deviation deviations deviations
For example, if the mean of a normal distribution is 30, and its standard deviation is 3.5,
then 34.1% of all observations will be between 30 and 33.5.
Even though the form of the normal distribution varies according to the mean and standard
deviation, it will always be bell-shaped and those percentages will always stay the same.
It is important to understand that the area under the curve represents the probability of
selecting a particular value by accident. Values in the middle of normal distribution are more
likely to be selected - this is why more of the distribution area is concentrated in the middle.
On the other hand, values that are more than 2 standard deviations from the average are at
the edges and will rarely be selected.
Where SE is the standard error, is the standard deviation and n is the number of
observation in 1 sample.
The Z score
Before we can start learning to perform the Z test, there is one last (and simple) key idea we
need to understand: Z score. The Z score which is given to one observation placed inside the
distribution of single observations express the distance of the observation from the mean in
units of standard deviation. More commonly, it is used for expressing the distance between
the mean of one sample of observations placed inside the sample distribution. The formula
for generating Z score is:
Where is the Z score, is the observation's value, is the mean of all observation and
is the Standard Error.
Let's get familiar with a specific kind of normal distribution: The Standard normal
distribution. In order to construct it, we can take each of the observation and give it a Z
score. For the Z test. It needs to be done with the means of the samples in the sample
distribution. With all of those Z scores, we can create the Standard normal distribution. This
distribution has a few interesting features: its mean is always 0, and its new standard
deviation is always 1. Also, the Standard normal distribution created from any sample
distribution is completely identical -"Regardless to its original mean or SD. The Standard
normal distribution is defined as a normal distribution with average = 0 and standard
deviation = 1.
The Standard normal distribution helps us understand the uniqueness of a certain value.
Consider this question: Is it more unlikely to randomly walk into a person (let's call him
David) on the street whose height is 2 meters, or into a person (let's call him Adam) whose
weight is 200 pounds? This is something that is hard to appreciate without some kind of
standardized scoring system. But if I were to tell you that David's height's Z score is 2 and
Adam's weight's Z score is 1, it would mean that there is much less people taller than David
then people weighting more than Adam.
-1 0 1 2 3
Adam David
As you can see, 15.8% of the population weight more than Adam, while only 2.2% is taller
than David.
What's next
In this post we have laid the necessary foundations to understand the Z test and many other
kinds of significance tests. In part 2, we will learn how to perform the Z test. We will also
discuss when it is right to use it and what its limitations are. And finally, we will learn to use
it with Python.