Desc Excel
Desc Excel
Descriptive statistics summarize and organize characteristics of a data set. A data set is a
collection of responses or observations from a sample or entire population.
In quantitative research, after collecting data, the first step of statistical analysis is to describe
characteristics of the responses, such as the average of one variable (e.g., age), or the relation
between two variables (e.g., age and creativity).
The next step is inferential statistics, which help you decide whether your data confirms or
refutes your hypothesis and whether it is generalizable to a larger population.
Table of contents
1.
2.
3.
4.
5.
6.
7.
8.
You can apply these to assess only one variable at a time, in univariate analysis, or to
compare two or more, in bivariate and multivariate analysis.
Research exampleYou want to study the popularity of different leisure activities by gender.
You distribute a survey and ask participants how many times they did each of the following
in the past year:
Go to a library
Watch a movie at a theater
Visit a national park
Your data set is the collection of responses to the survey. Now you can use descriptive
statistics to find out the overall frequency of each activity (distribution), the averages for each
activity (central tendency), and the spread of responses for each activity (variability).
Receive feedback on language, structure, and formatting
Professional editors proofread and edit your paper by focusing on:
Academic style
Vague sentences
Grammar
Style consistency
See an example
Frequency distribution
A data set is made up of a distribution of values, or scores. In tables or graphs, you can
summarize the frequency of every possible value of a variable in numbers or percentages.
This is called a frequency distribution.
Male 182
Female 235
Other 27
From this table, you can see that more women than men or people with another gender
identity took part in the study.
Here we will demonstrate how to calculate the mean, median, and mode using the first 6
responses of our survey.
Mean
Median
Mode
The mean, or M, is the most commonly used method for finding the average.
To find the mean, simply add up all response values and divide the sum by the total number
of responses. The total number of responses or observations is called N.
Measures of central tendency help you find the middle, or the average, of a dataset. The 3
most common measures of central tendency are the mode, median, and mean.
In addition to central tendency, the variability and distribution of your dataset is important to
understand when performing descriptive statistics.
Table of contents
1.
2.
3.
4.
5.
6.
7.
Normal distribution
In a normal distribution, data is symmetrically distributed with no skew. Most values cluster
around a central region, with values tapering off as they go further away from the center. The
mean, mode and median are exactly the same in a normal distribution.
Example: Normal distributionYou survey a sample in your local community on the number
of books they read in the last year.
A histogram of your data shows the frequency of responses for each possible number of
books. From looking at the chart, you see that there is a normal distribution.
The mean, median and mode are all equal; the central tendency of this dataset is 8.
Skewed distributions
In skewed distributions, more values fall on one side of the center than the other, and the
mean, median and mode all differ from each other. One side has a more spread out and longer
tail with fewer scores at one end than the other. The direction of this tail tells you the side of
the skew
In a positively skewed distribution, there’s a cluster of lower scores and a spread out tail on
the right. In a negatively skewed distribution, there’s a cluster of higher scores and a spread
out tail on the left.
Mean
The mean (aka the arithmetic mean, different from the geometric mean) of a dataset is the
sum of all values divided by the total number of values. It’s the most commonly used
measure of central tendency and is often referred to as the “average.”
1.
The formulas for the sample mean and the population mean only differ in mathematical
notation. Population attributes use capital letters while sample attributes use lowercase letters.
Population mean
= population mean
= sum of each value in the population
Sample mean
= sample mean
Let’s say you want to find the average amount people spend on a restaurant meal in your
neighborhood. You ask a sample of 8 neighbors how much they spent the last time they went
out for dinner, and find the mean cost.
Data set
42 + 13 + 31 + 87 + 24 + 58 + 76 + 69
= 400
Formula Calculation
=8
= 400
= 400 8
= 50
The mean tells us that in our sample, participants spent an average of 50 USD on their
restaurant bill.
Let’s see what happens to the mean when we add an outlier to our data set.
Data set
Cost of dinner for two (USD) 42 13 31 87 24 58 76 69 230
42 + 13 + 31 + 87 + 24 + 58 + 76 + 69 + 230
= 630
Step 2: Divide the sum by the number of values
Formula Calculation
=9
= 630
= 630 9
= 70
As we can see, adding just one outlier to our data set raised the mean by 20 USD. In this case,
a different measure of central tendency, like the median, would be more appropriate.
Type of variable
The mean can only be calculated for quantitative variables (e.g., height), and it can’t be found
for categorical variables (e.g., gender).
In categorical variables, data is placed into groupings without exact numerical values, so the
mean cannot be calculated. For categorical variables, the mode is the best measure of central
tendency because it tells you the most common characteristic or popular choice for your
sample.
But for continuous or discrete variables, you have exact numerical values. With these, you
can easily calculate the mean or median.
Distribution shape
The mean is best for data sets with normal distributions. In a normal distribution, data is
symmetrically distributed with no skew. Most values cluster around a central region, with
values tapering off as they go further away from the center.
The mean, mode and median are exactly the same in a normal distribution.
In skewed distributions, more values fall on one side of the center than the other, and the
mean, median and mode all differ from each other. One side has a more spread out and longer
tail with fewer scores at one end than the other.
For skewed distributions and distributions with outliers, the mean is easily influenced by
extreme values and may not accurately represent the central tendency. The median is a better
measure for these distributions as it takes a value from the middle of the distribution.
Alternatively, you can systematically review and remove outliers from your dataset in
the data cleansing process.
Mode
The mode is the most frequently occurring value in the dataset. It’s possible to have no mode,
one mode, or more than one mode.
To find the mode, sort your dataset numerically or categorically and select the response that
occurs most frequently.
Example: Finding the modeIn a survey, you ask 9 participants whether they identify as
conservative, moderate, or liberal.
To find the mode, sort your data by category and find which response was chosen most
frequently.
To make it easier, you can create a frequency table to count up the values for each category.
Conservative 2
Moderate 3
Liberal 4
Mode: Liberal
The mode is easily seen in a bar graph because it is the value with the highest bar.
When to use the mode
The mode is most applicable to data from a nominal level of measurement. Nominal data is
classified into mutually exclusive categories, so the mode tells you the most popular
category.
For continuous variables or ratio levels of measurement, the mode may not be a helpful
measure of central tendency. That’s because there are many more possible values than there
are in a nominal or ordinal level of measurement. It’s unlikely for a value to repeat in a
ratio level of measurement.
Example: Ratio data with no modeYou collect data on reaction times in a computer task, and
your dataset contains values that are all different from each other.
Participant 1 2 3 4 5 6 7 8 9
Reaction time (milliseconds) 267 345 421 324 401 312 382 298 303
In this dataset, there is no mode, because each value occurs only once.
Median
The median of a dataset is the value that’s exactly in the middle when it is ordered from low
to high.
Example: Finding the medianYou measure the reaction times of 7 participants on a computer
task and categorize them into 3 groups: slow, medium or fast.
Participant 1 2 3 4 5 6 7
To find the median, you first order all values from low to high. Then, you find the value in
the middle of the ordered dataset—in this case, the value in the 4th position.
Median: Medium
In larger datasets, it’s easier to use simple formulas to figure out the position of the middle
value in the distribution. You use different methods to find the median of a dataset depending
on whether the total number of values is even or odd.
For an odd-numbered dataset, find the value that lies at the position, where n is the
number of values in the dataset.
ExampleYou measure the reaction times in milliseconds of 5 participants and order the
dataset.
Reaction time (milliseconds) 287 298 345 365 380
That means the median is the 3rd value in your ordered dataset.
ExampleYou measure the reaction times of 6 participants and order the dataset.
Reaction time (milliseconds) 287 298 345 357 365 380
That means the middle values are the 3rd value, which is 345, and the 4th value, which
is 357.
To get the median, take the mean of the 2 middle values by adding them together and
dividing by 2.
Median: 351 milliseconds
Mean
The arithmetic mean of a dataset (which is different from the geometric mean) is the sum of
all values divided by the total number of values. It’s the most commonly used measure of
central tendency because all values are used in the calculation.
Example: Mean with an outlierIn this dataset, we swap out one value with an extreme outlier.
Participant 1 2 3 4 5
While data from a sample can help you make estimates about a population, only full
population data can give you the complete picture.
In statistics, the notation of a sample mean and a population mean and their formulas are
different. But the procedures for calculating the population and sample means are the same.
x̄ : sample mean
: sum of all values in the sample dataset
n: number of values in the sample dataset
Population mean formulaThe population mean is written as μ (Greek term mu). For
calculating the mean of a population, use this formula:
μ: population mean
: sum of all values in the population dataset
N: number of values in the population dataset
The mode can be used for any level of measurement, but it’s most meaningful for
nominal and ordinal levels.
The median can only be used on data that can be ordered – that is, from ordinal,
interval and ratio levels of measurement.
The mean can only be used on interval and ratio levels of measurement because it
requires equal spacing between adjacent values or scores in the scale.
To decide which measures of central tendency to use, you should also consider the
distribution of your dataset.
For normally distributed data, all three measures of central tendency will give you the same
answer so they can all be used.
Mode
The mode or modal value of a data set is the most frequently occurring value. It’s a measure
of central tendency that tells you the most popular choice or most common characteristic of
your sample.
When reporting descriptive statistics, measures of central tendency help you find the middle
or the average of your data set. The three most common measures of central tendency are the
mode, median, and mean.
1. If the data for your variable takes the form of numerical values, order the values from
low to high. If it takes the form of categories or groupings, sort the values by group, in
any order.
2. Identify the value or values that occur most frequently.
Data set
Participant A B C D E F
Age 19 22 20 21 22 23
By ordering the values from low to high, we can easily see the value that occurs most
frequently.
Data set
Participant A B C D E F
To sort the values by group, you create a simple frequency table. Place the categories on the
left hand side and the frequencies on the right hand side.
Frequency table
Parents’ education level Frequency
Bachelor’s degree 2
Master’s degree 2
High school diploma 1
Doctoral degree 1
From the table, you can see that there are two modes. This means you have a bimodal data
set.
A grouped frequency table organizes large numerical data sets into intervals or classes of
values and reports the frequency of values in each class.
For grouped data, you can report the mode in two ways:
the modal class is the grouping with the highest frequency of values.
the modal value is estimated as the midpoint of the modal class.
The mode is only an estimate in this case, because the actual values within the modal class
are unknown.
Reaction times are placed in classes of 100 milliseconds each. The frequency column shows
the number of participants within each class.
200–299 6
300–399 13
400–499 17
500–599 25
600–699 21
700–799 12
800–899 4
You can visualize your data set by plotting your data on a histogram. The mode is the value
with the highest peak on a histogram or bar chart.
From your table or histogram, you can see that the modal class – the group in which values
appear most frequently – is 500–599 milliseconds. Therefore, the mode is estimated to be at
the midpoint of this class: 550 milliseconds.
Importantly, the choice of intervals in grouped data can have a large impact on the mode. For
example, changing the intervals from 100 ms long to 50 or 200 ms long could result in
completely different modes.
The mode works best with categorical data. It is the only measure of central tendency
for nominal variables, where it can reflect the most commonly found characteristic (e.g.,
demographic information). The mode is also useful with ordinal variables – for example, to
reflect the most popular answer on a ranked scale (e.g., level of agreement).
For quantitative data, such as reaction time or height, the mode may not be a helpful measure
of central tendency. That’s because there are often many more possible values for
quantitative data than there are for categorical data, so it’s unlikely for values to repeat.
In this data set, there is no mode, because each value occurs only once.
Median
The median is the value that’s exactly in the middle of a dataset when it is ordered. It’s a
measure of central tendency that separates the lowest 50% from the highest 50% of values.
The steps for finding the median differ depending on whether you have an odd or an even
number of data points. If there are two numbers in the middle of a dataset, their mean is the
median.
The median is usually used with quantitative data (where the values are numerical), but you
can sometimes also find the median for an ordinal dataset (where the values are ranked
categories).
You can calculate the median by hand or with the help of our median calculator below.
Dataset
Weekly pay (USD) 350 800 220 500 130
Let’s add another value to the dataset. Now you have 6 values.
Dataset
Weekly pay (USD) 350 800 220 500 130 1150
The middle positions are found using the formulas and , where n is the number of
values in your dataset.
Calculatio
Formula
Ordinal data is organized into categories with a rank order – for example language ability
level (beginner, intermediate, or fluent) or level of agreement (strongly agree, agree, etc.).
Odd-numbered dataset
We’ll walk through the steps for an odd-numbered ordinal dataset with 7 values.
You categorize reaction times of participants into 3 groups: slow, medium or fast.
First, order all values in ascending order.
Ordered dataset
Reaction speed Slow Slow Medium Medium Fast Fast Fast
Next, find the middle value using , where n is the number of values in the dataset.
For example, if the two middle values are “slow” and “medium,” you can’t calculate the
mean of these values.
In practice, ordinal data is sometimes converted into a numerical format and treated like
quantitative data for the sake of convenience. Then the mean of the middle values can be
calculated to find the median.
While this is considered acceptable in some contexts, it is not always seen as correct.
In skewed distributions, more values fall on one side of the center than the other, and the
mean, median and mode all differ from each other.
In a positively skewed distribution, there’s a cluster of lower scores and a spread out tail on
the right.
In a negatively skewed distribution, there’s a cluster of higher scores and a spread out tail on
the left.
Because the median only uses one or two values from the middle of a dataset, it’s unaffected
by extreme outliers or non-symmetric distributions of scores. In contrast, the positions of the
mean and mode can vary in skewed distributions.
For this reason, the median is often reported as a measure of central tendency for variables
such as income, because these distributions are usually positively skewed.
The level of measurement of your variable also determines whether you can use the median.
The median can only be used on data that can be ordered – that is,
from ordinal, interval and ratio levels of measurement.
Geometric Mean
The geometric mean is an average that multiplies all values and finds a root of the
number. For a dataset with n numbers, you find the nth root of their product. You can
use this descriptive statistic to summarize your data.
The geometric mean is an alternative to the arithmetic mean, which is often referred
to simply as “the mean.” While the arithmetic mean is based on adding values, the
geometric mean multiplies values.
The geometric mean formula can be written in two ways, but they are equivalent
mathematically.
= product of …
= every value
= total number of values
= reciprocal of
The symbol pi ( ) is similar to the summation sign sigma (Σ), but instead it tells you
to find the product of what follows after it by multiplying them all together.
In the first formula, the geometric mean is the nth root of the product of all values.
In the second formula, the geometric mean is the product of all values raised to the
power of the reciprocal of n.
These formulas are equivalent because of the laws of exponents: taking the nth root
of x is exactly the same as raising x to the power of 1/n.
We’ll walk you through some examples showing how to find the geometric means of
different types of data.
Formula Calculation
Step 2: Find the nth root of the product (n is the number of values).
Formula Calculation
The average voter turnout of the past five US elections was 54.64%.
Machine A 7 80 2100
Machine B 3 94 2350
Geometric mean of Machine A
Step 1: Multiply all values together to get their product.
Formula Calculation
Step 2: Find the nth root of the product (n is the number of values).
Formula Calculation
Formula Calculation
Step 2: Find the nth root of the product (n is the number of values).
Formul Calculation
a
While the arithmetic means show higher efficiency for Machine B, the geometric
means show that Machine B is more efficient.
The geometric mean is more accurate here because the arithmetic mean is skewed
towards values that are higher than most of your dataset.
For example, say you study fruit fly population growth rates. You’re interested in
understanding how environmental factors change these rates.
You begin with 2 fruit flies, and every 12 days you measure the percentage increase
in the population.
Each percentage change value is also converted into a growth factor that is in
decimals. The growth factor includes the original value (100%), so to convert
percentage increase into a growth factor, add 100 to each percentage increase and
divide by 100.
Day 12 24 36
First, you convert percentage change into decimals. You add 100 to each value to
factor in the original amount, and divide each value by 100.
Arithmetic mean
To find the arithmetic mean, add up all values and divide this number by n.
Formula Calculation
Geometric mean
Step 1: Multiply all values together to get their product.
Formul Calculation
a
Step 2: Find the nth root of the product (n is the number of values).
Formul Calculation
a
The arithmetic mean population growth factor is 4.18, while the geometric mean
growth factor is 4.05.
Only the geometric mean gives us the true number of fruit flies in the final population.
It’s the most accurate mean for the growth factor.
While most values tend to be low, the arithmetic mean is often pulled upward (or
rightward) by high values or outliers in a positively skewed dataset.
Because the geometric mean tends to be lower than the arithmetic mean, it
represents smaller values better than the arithmetic mean.
Measures of variability
Measures of variability give you a sense of how spread out the response values are. The
range, standard deviation and variance each reflect different aspects of spread.
Variability describes how far apart data points lie from each other and from the
center of a distribution. Along with measures of central tendency, measures of
variability give you descriptive statistics that summarize your data.
While the central tendency, or average, tells you where most of your points lie,
variability summarizes how far apart they are. This is important because the amount
of variability determines how well you can generalize results from the sample to your
population.
Low variability is ideal because it means that you can better predict information
about the population based on sample data. High variability means that the values
are less consistent, so it’s harder to make predictions.
Data sets can have the same central tendency but different levels of variability
or vice versa. If you know only the central tendency or the variability, you can’t say
anything about the other aspect. Both of them together give you a complete picture
of your data.
Example: Variability in normal distributionsYou are investigating the amounts of time spent
on phones daily by different groups of people.
Using simple random samples, you collect data from 3 groups:
All three of your samples have the same average phone use, at 195 minutes or 3 hours and 15
minutes. This is the x-axis value where the peak of the curves are.
Although the data follows a normal distribution, each sample has different spreads. Sample A
has the largest variability while Sample C has the smallest variability.
Range
The range tells you the spread of your data from the lowest to the highest value in
the distribution. It’s the easiest measure of variability to calculate.
range is the spread of your data from the lowest to the highest value in the
distribution. It is a commonly used measure of variability.
Along with measures of central tendency, measures of variability give you descriptive
statistics for summarizing your data set.
The range is calculated by subtracting the lowest value from the highest value. While
a large range means high variability, a small range means low variability in a
distribution.
R = range
H = highest value
L = lowest value
The range is the easiest measure of variability to calculate. To find the range, follow
these steps:
To find the range, simply subtract the lowest value from the highest value in the data
set.
Age 37 19 31 29 21 26 33 36
First, order the values from low to high to identify the lowest value (L) and
the highest value (H).
Age 19 21 26 29 31 33 36 37
R=H–L
R = 37 – 19 = 18
The range of our data set is 18 years.
But the range can be misleading when you have outliers in your data set. One
extreme value in the data will give you a completely different range.
Range example with an outlierOne value in your data set is replaced with an outlier.
Age 19 21 26 29 31 33 36 61
Using the same calculation, we get a very different result this time:
R= H–L
R = 61 – 19 = 42
Because only two numbers are used, the range is easily influenced by outliers. It
can’t tell you about the shape of the frequency distribution of values on its own.
NoteTo get a clear idea of your data’s variability, the range is best used in combination with
other measures of variability like interquartile range and standard deviation.
The highest value (H) is 324 and the lowest (L) is 72.
R=H–L
R = 324 – 72 = 252
Interquartile range
The interquartile range gives you the spread of the middle of your distribution.
For any distribution that’s ordered from low to high, the interquartile range contains
half of the values. While the first quartile (Q1) contains the first 25% of values, the
fourth quartile (Q4) contains the last 25% of values.
the interquartile range tells you the spread of the middle half of your distribution.
Quartiles segment any distribution that’s ordered from low to high into four equal
parts. The interquartile range (IQR) contains the second and third quartiles, or the
middle half of your data set. Whereas the range gives you the spread of the whole
data set, the interquartile range gives you the range of the middle half of a data set.
The interquartile range is the third quartile (Q3) minus the first quartile (Q1). This
gives us the range of the middle half of a data set.
Interquartile range exampleTo find the interquartile range of your 8 data points, you first
find the values at Q1 and Q3.
Multiply the number of values in the data set (8) by 0.25 for the 25th percentile (Q1) and by
0.75 for the 75th percentile (Q3).
Q1 position: 0.25 x 8 = 2
Q3 position: 0.75 x 8 = 6
Q1 is the value in the 2nd position, which is 110. Q3 is the value in the 6th position, which
is 287.
IQR = Q3 – Q1
The IQR gives a consistent measure of variability for skewed as well as normal
distributions.
Calculate the interquartile range by hand
The interquartile range is found by subtracting the Q1 value from the Q3 value:
Formula Explanation
Q1 is the value below which 25 percent of the distribution lies, while Q3 is the value
below which 75 percent of the distribution lies.
You can think of Q1 as the median of the first half and Q3 as the median of the
second half of the distribution.
Here, we’ll discuss two of the most commonly used methods. These methods differ
based on how they use the median.
The procedure for finding the median is different depending on whether your data set
is odd- or even-numbered.
When you have an odd number of data points, the median is the value in the
middle of your data set. You can choose between the inclusive and exclusive
method.
With an even number of data points, there are two values in the middle, so
the median is their mean. It’s more common to use the exclusive method in
this case.
While there is little consensus on the best method for finding the interquartile range,
the exclusive interquartile range is always larger than the inclusive interquartile
range.
The exclusive interquartile range may be more appropriate for large samples, while
for small samples, the inclusive interquartile range may be more representative
because it’s a narrower range.
Step 2: Locate the median, and then separate the values below it from the values above it.
With an even-numbered data set, the median is the mean of the two values in the middle, so you simply
divide your data set into two halves.
Q1 is the median of the first half and Q3 is the median of the second half. Since each of these halves have
an odd number of values, there is only one value in the middle of each half.
Step 2: Locate the median, and then separate the values below it from the values above it.
In an odd-numbered data set, the median is the number in the middle of the list. The median itself is
excluded from both halves: one half contains all values below the median, and the other contains all the
values above it.
The inclusive method is sometimes preferred for odd-numbered data sets because it
doesn’t ignore the median, a real value in this type of data set.
Step 2: Separate the list into two halves, and include the median in both halves.
The median is included as the highest value in the first half and the lowest value in the second half.
We can see from these examples that using the inclusive method gives us a smaller
IQR. With the same data set, the exclusive IQR is 24, and the inclusive IQR is 20.
For these frequency distributions, the median is the best measure of central
tendency because it’s the value exactly in the middle when all values are ordered
from low to high.
Along with the median, the IQR can give you an overview of where most of your
values lie and how clustered they are.
The IQR is also useful for datasets with outliers. Because it’s based on the middle
half of the distribution, it’s less influenced by extreme values.
Lowest value
Q1: 25th percentile
Median
Q3: 75th percentile
Highest value (Q4)
The vertical lines in the box show Q1, the median, and Q3, while the whiskers at the
ends show the highest and lowest values.
In a boxplot, the width of the box shows you the interquartile range. A smaller width
means you have less dispersion, while a larger width means you have more
dispersion.
The placement of the box tells you the direction of the skew. A box that’s much
closer to the right side means you have a negatively skewed distribution, and a box
closer to the left side tells you that you have a positively skewed distribution.
Other interesting article
Five-number summary
Every distribution can be organized using a five-number summary:
Lowest value
Q1: 25th percentile
Q2: the median
Q3: 75th percentile
Highest value (Q4)
These five-number summaries can be easily visualized using box and whisker plots.
Box and whisker plot exampleFor each of our samples, the horizontal lines in a box show Q1,
the median and Q3, while the whiskers at the end show the highest and lowest values.
Standard deviation
The standard deviation is the average amount of variability in your dataset.
It tells you, on average, how far each score lies from the mean. The larger the
standard deviation, the more variable the data set is.
There are six steps for finding the standard deviation by hand:
Formula Explanation
Formula Explanation
When you have population data, you can get an exact value for population standard
deviation. Since you collect data from every population member, the standard
deviation reflects the precise amount of variability in your distribution, the population.
But when you use sample data, your sample standard deviation is always used as
an estimate of the population standard deviation. Using n in this formula tends to
give you a biased estimate that consistently underestimates variability.
Reducing the sample n to n – 1 makes the standard deviation artificially large, giving
you a conservative estimate of variability.
Variance
The variance is the average of squared deviations from the mean. A deviation from
the mean is how far a score lies from the mean.
Variance is the square of the standard deviation. This means that the units of
variance are much larger than those of a typical value of a data set.
While it’s harder to interpret the variance number intuitively, it’s important to calculate
variance for comparing different data sets in statistical tests like ANOVAs.
Variance reflects the degree of spread in the data set. The more spread the data, the
larger the variance is in relation to the mean.
= population variance
= sum of…
= each value
= population mean
= number of values in the
population
= sample variance
= sum of…
= each value
= sample mean
= number of values in the sample
Just like for standard deviation, there are different formulas for population and
sample variance. But while there is no unbiased estimate for standard deviation,
there is one for sample variance.
If the sample variance formula used the sample n, the sample variance would be
biased towards lower numbers than expected. Reducing the sample n to n – 1
makes the variance artificially larger.
In this case, bias is not only lowered but totally removed. The sample variance
formula gives completely unbiased estimates of variance.
That’s because sample standard deviation comes from finding the square root of
sample variance. Since a square root isn’t a linear operation, like addition or
subtraction, the unbiasedness of the sample variance formula isn’t carried over the
sample standard deviation formula.
For more complex interval and ratio levels, the standard deviation and variance are
also applicable.
Distribution
For normal distributions, all measures can be used. The standard deviation and
variance are preferred because they take your whole data set into account, but this
also means that they are easily influenced by outliers.
For skewed distributions or data sets with outliers, the interquartile range is the best
measure. It’s least affected by extreme values because it focuses on the spread in
the middle of the data set.
Range
The range gives you an idea of how far apart the most extreme response scores are. To find
the range, simply subtract the lowest value from the highest value.
The range is a simple measure that tells you the spread of values in a data set. It
has a simple definition:
Range = maximum value – minimum value
So if you have a set of data such as 4, 2, 5, 8, 12, 15, the range is the highest
number (15) minus the lowest number (2). In this case:
Range = 15-2 = 13
This example tells you that the data set spans 13 numbers. In a box and whisker
plot, the ends of the whiskers give you a visual indication of the range, because
they mark the minimum and maximum values. A large range suggests a wide
spread of results, and a small range suggests data that is closely centered around a
specific value.
Range of visits to the library in the past yearOrdered data set: 0, 3, 3, 12, 15, 24
Range: 24 – 0 = 24
Standard deviation
The standard deviation (s or SD) is the average amount of variability in your dataset. It tells
you, on average, how far each score lies from the mean. The larger the standard deviation, the
more variable the data set is.
The standard deviation is the average amount of variability in your dataset. It tells
you, on average, how far each value lies from the mean.
A high standard deviation means that values are generally far from the mean, while a
low standard deviation indicates that values are clustered close to the mean.
Example: Comparing different standard deviationsYou collect data on job satisfaction ratings
from three groups of employees using simple random sampling.
The mean (M) ratings are the same for each group – it’s the value on the x-axis when the
curve is at its peak. However, their standard deviations (SD) differ from each other.
The standard deviation reflects the dispersion of the distribution. The curve with the lowest
standard deviation has a high peak and a small spread, while the curve with the highest
standard deviation is more flat and widespread.
The empirical rule
The standard deviation and the mean together can tell you where most of the values
in your frequency distribution lie if they follow a normal distribution.
The empirical rule, or the 68-95-99.7 rule, tells you where your values lie:
Formula Explanation
With samples, we use n – 1 in the formula because using n would give us a biased
estimate that consistently underestimates variability. The sample standard deviation
would tend to be lower than the real standard deviation of the population.
Reducing the sample n to n – 1 makes the standard deviation artificially large, giving
you a conservative estimate of variability.
There are six main steps for finding the standard deviation by hand. We’ll use a
small data set of 6 scores to walk through the steps.
Data set
46 6 32 60 5 41
9 2
Step 1: Find the mean
To find the mean, add up all the scores, then divide them by the number of scores.
Mean (x̅ )
46 46 – 50 = -4
69 69 – 50 = 19
32 32 – 50 = -18
60 60 – 50 = 10
52 52 – 50 = 2
41 41 – 50 = -9
(-4)2 = 4 × 4 = 16
192 = 19 × 19 = 361
102 = 10 × 10 = 100
22 = 2 × 2 = 4
(-9)2 = -9 × -9 = 81
Sum of squares
Variance
Standard deviation
From learning that SD = 13.31, we can say that each score deviates from the mean
by 13.31 points on average.
Why is standard deviation a useful measure of
variability?
Although there are simpler ways to calculate variability, the standard deviation
formula weighs unevenly spread out samples more than evenly spread samples. A
higher standard deviation tells you that the distribution is not only more spread out,
but also more unevenly spread out.
This means it gives you a better idea of your data’s variability than simpler
measures, such as the mean absolute deviation (MAD).
The MAD is similar to standard deviation but easier to calculate. First, you express
each deviation from the mean in absolute values by converting them into positive
numbers (for example, -3 becomes 3). Then, you calculate the mean of these
absolute deviations.
Unlike the standard deviation, you don’t have to calculate squares or square roots of
numbers for the MAD. However, for that reason, it gives you a less precise measure
of variability.
Let’s take two samples with the same central tendency but different amounts of
variability. Sample B is more variable than Sample A.
For samples with equal average deviations from the mean, the MAD can’t
differentiate levels of spread. The standard deviation is more precise: it is higher for
the sample with more variability in deviations from the mean.
By squaring the differences from the mean, standard deviation reflects uneven
dispersion more accurately. This step weighs extreme deviations more heavily than
small deviations.
Standard deviations of visits to the library in the past yearIn the table below, you
complete Steps 1 through 4.
Raw Deviation from mean Squared deviation
data
From learning that s = 9.18, you can say that on average, each score deviates from the mean
by 9.18 points.
Variance
The variance is the average of squared deviations from the mean. Variance reflects the degree
of spread in the data set. The more spread the data, the larger the variance is in relation to the
mean.
To find the variance, simply square the standard deviation. The symbol for variance is s2.
Variance of visits to the library in the past yearData set: 15, 3, 12, 0, 24, 3
s = 9.18
s2 = 84.3
The standard deviation is derived from variance and tells you, on average, how far
each value lies from the mean. It’s the square root of variance.
Since the units of variance are much larger than those of a typical value of a data
set, it’s harder to interpret the variance number intuitively. That’s why standard
deviation is often preferred as a main measure of variability.
However, the variance is more informative about variability than the standard
deviation, and it’s used in making statistical inferences.
Population variance
When you have collected data from every member of the population that you’re
interested in, you can get an exact value for population variance.
Formula Explanation
= population variance
= sum of…
Χ = each value
= population mean
Ν = number of values in the population
Sample variance
When you collect data from a sample, the sample variance is used to make
estimates or inferences about the population variance.
= sample variance
= sum of…
Χ = each value
= sample mean
n = number of values in the sample
With samples, we use n – 1 in the formula because using n would give us a biased
estimate that consistently underestimates variability. The sample variance would
tend to be lower than the real variance of the population.
Reducing the sample n to n – 1 makes the variance artificially large, giving you an
unbiased estimate of variability: it is better to overestimate rather than underestimate
variability in samples.
It’s important to note that doing the same thing with the standard deviation formulas
doesn’t lead to completely unbiased estimates. Since a square root isn’t a linear
operation, like addition or subtraction, the unbiasedness of the sample variance
formula doesn’t carry over the sample standard deviation formula.
There are five main steps for finding the variance by hand. We’ll use a small data set
of 6 scores to walk through the steps.
Data set
46 6 32 60 5 41
9 2
Mean ( )
= (46 + 69 + 32 + 60 + 52 + 41) 6
= 50
Step 2: Find each score’s deviation from the mean
Subtract the mean from each score to get the deviations from the mean.
46 46 – 50 = -4
69 69 – 50 = 19
32 32 – 50 = -18
60 60 – 50 = 10
52 52 – 50 = 2
41 41 – 50 = -9
(-4)2 = 4 × 4 = 16
192 = 19 × 19 = 361
102 = 10 × 10 = 100
22 = 2 × 2 = 4
(-9)2 = -9 × -9 = 81
Variance
886 (6 – 1) = 886 5
= 177.2
Uneven variances between samples result in biased and skewed test results. If you
have uneven variances across samples, non-parametric tests are more appropriate.
Research exampleAs an education researcher, you want to test the hypothesis that different
frequencies of quizzes lead to different final scores of college students. You collect the final
scores from three groups with 20 students each that had quizzes frequently, infrequently, or
rarely over a semester.
Research exampleYour ANOVA assesses whether the differences in mean final scores
between groups come from the differences in the frequency of quizzes or the individual
differences of the students in each group.
To do so, you get a ratio of the between-group variance of final scores and the within-group
variance of final scores – this is the F-statistic. With a large F-statistic, you find the
corresponding p-value, and conclude that the groups are significantly different from each
other.
Likewise, while the range is sensitive to outliers, you should also consider the standard
deviation and variance to get easily comparable measures of spread.
Multivariate analysis is the same as bivariate analysis but with more than two variables.
Contingency table
In a contingency table, each cell represents the intersection of two variables. Usually,
an independent variable (e.g., gender) appears along the vertical axis and a dependent one
appears along the horizontal axis (e.g., activities). You read “across” the table to see how the
independent and dependent variables relate to each other.
Childre 32 68 37 23 22
n
Adults 36 48 43 83 25
Interpreting a contingency table is easier when the raw data is converted to percentages.
Percentages make each row comparable to the other by making it seem as if each group had
only 100 observations or participants. When creating a percentage-based contingency table,
you add the N for each independent variable on the end.
From this table, it is more clear that similar proportions of children and adults go to the
library over 17 times a year. Additionally, children most commonly went to the library
between 5 and 8 times, while for adults, this number was between 13 and 16.
Scatter plots
A scatter plot is a chart that shows you the relationship between two or three variables. It’s a
visual representation of the strength of a relationship.
In a scatter plot, you plot one variable along the x-axis and another one along the y-axis. Each
data point is represented by a point in the chart.
Scatter plot example: Library visits and movie theater visitsYou investigate whether people
who visit the library more tend to watch a movie at a theater less. You plot the number of
times participants watched movies at a theater along the x-axis and visits to the library along
the y-axis.
From your scatter plot, you see that as the number of movies seen at movie theaters increases,
the number of visits to the library decreases. Based on your visual assessment of a possible
linear relationship, you perform further tests of correlation and regression.
If your data is in a single row or column, type a colon followed by the letter and
number corresponding to the last data point and then close the parentheses to return
the minimum value. You can also do this by clicking the appropriate cell after
opening the parentheses and then holding down "Shift" and clicking the cell with
the last data point before closing the parentheses.