0% found this document useful (0 votes)
17 views

Desc Excel

Descriptive statistics summarize and organize characteristics of data. There are three main types: distribution concerns frequency of values, central tendency concerns averages, and variability concerns how spread out values are. Measures of central tendency like mean, median, and mode estimate the center or average of a dataset.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Desc Excel

Descriptive statistics summarize and organize characteristics of data. There are three main types: distribution concerns frequency of values, central tendency concerns averages, and variability concerns how spread out values are. Measures of central tendency like mean, median, and mode estimate the center or average of a dataset.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 65

Descriptive Statistics | Definitions, Types, Examples

Published on July 9, 2020 by Pritha Bhandari. Revised on June 21, 2023.

Descriptive statistics summarize and organize characteristics of a data set. A data set is a
collection of responses or observations from a sample or entire population.

In quantitative research, after collecting data, the first step of statistical analysis is to describe
characteristics of the responses, such as the average of one variable (e.g., age), or the relation
between two variables (e.g., age and creativity).

The next step is inferential statistics, which help you decide whether your data confirms or
refutes your hypothesis and whether it is generalizable to a larger population.

Table of contents

1.
2.
3.
4.
5.
6.
7.
8.

Types of descriptive statistics


There are 3 main types of descriptive statistics:

 The distribution concerns the frequency of each value.


 The central tendency concerns the averages of the values.
 The variability or dispersion concerns how spread out the values are.

You can apply these to assess only one variable at a time, in univariate analysis, or to
compare two or more, in bivariate and multivariate analysis.

Research exampleYou want to study the popularity of different leisure activities by gender.
You distribute a survey and ask participants how many times they did each of the following
in the past year:

 Go to a library
 Watch a movie at a theater
 Visit a national park

Your data set is the collection of responses to the survey. Now you can use descriptive
statistics to find out the overall frequency of each activity (distribution), the averages for each
activity (central tendency), and the spread of responses for each activity (variability).
Receive feedback on language, structure, and formatting
Professional editors proofread and edit your paper by focusing on:

 Academic style
 Vague sentences
 Grammar
 Style consistency

See an example

Frequency distribution
A data set is made up of a distribution of values, or scores. In tables or graphs, you can
summarize the frequency of every possible value of a variable in numbers or percentages.
This is called a frequency distribution.

 Simple frequency distribution table

 Grouped frequency distribution table


For the variable of gender, you list all possible answers on the left hand column. You count
the number or percentage of responses for each answer and display it on the right hand
column.
Gende Number
r

Male 182

Female 235

Other 27

From this table, you can see that more women than men or people with another gender
identity took part in the study.

Measures of central tendency


Measures of central tendency estimate the center, or average, of a data set. The mean, median
and mode are 3 ways of finding the average.

Here we will demonstrate how to calculate the mean, median, and mode using the first 6
responses of our survey.

 Mean

 Median

 Mode
The mean, or M, is the most commonly used method for finding the average.

To find the mean, simply add up all response values and divide the sum by the total number
of responses. The total number of responses or observations is called N.

Mean number of library visits

Data set 15, 3, 12, 0, 24, 3

Sum of all values 15 + 3 + 12 + 0 + 24 + 3 = 57


Total number of N=6
responses

Mean Divide the sum of values by N to find M: 57/6


= 9.5

Central Tendency | Understanding the Mean, Median & Mode


Published on July 30, 2020 by Pritha Bhandari. Revised on June 21, 2023.

Measures of central tendency help you find the middle, or the average, of a dataset. The 3
most common measures of central tendency are the mode, median, and mean.

 Mode: the most frequent value.


 Median: the middle number in an ordered dataset.
 Mean: the sum of all values divided by the total number of values.

In addition to central tendency, the variability and distribution of your dataset is important to
understand when performing descriptive statistics.

Table of contents

1.
2.
3.
4.
5.
6.
7.

Distributions and central tendency


A dataset is a distribution of n number of scores or values.

Normal distribution
In a normal distribution, data is symmetrically distributed with no skew. Most values cluster
around a central region, with values tapering off as they go further away from the center. The
mean, mode and median are exactly the same in a normal distribution.

Example: Normal distributionYou survey a sample in your local community on the number
of books they read in the last year.
A histogram of your data shows the frequency of responses for each possible number of
books. From looking at the chart, you see that there is a normal distribution.
The mean, median and mode are all equal; the central tendency of this dataset is 8.

Skewed distributions
In skewed distributions, more values fall on one side of the center than the other, and the
mean, median and mode all differ from each other. One side has a more spread out and longer
tail with fewer scores at one end than the other. The direction of this tail tells you the side of
the skew

In a positively skewed distribution, there’s a cluster of lower scores and a spread out tail on
the right. In a negatively skewed distribution, there’s a cluster of higher scores and a spread
out tail on the left.

 Positively skewed distribution

 Negatively skewed distribution


In this histogram, your distribution is skewed to the right, and the central tendency of your
dataset is on the lower end of possible scores.

In a positively skewed distribution, mode < median < mean.


Here's why students love Scribbr's proofreading
services
Discover proofreading & editing

Mean
The mean (aka the arithmetic mean, different from the geometric mean) of a dataset is the
sum of all values divided by the total number of values. It’s the most commonly used
measure of central tendency and is often referred to as the “average.”

1.

Arithmetic mean calculator


You can calculate the mean by hand or with the help of our arithmetic mean calculator below.

Mean formulas for populations and samples


In research, you often collect data from samples and perform inferential statistics to
understand the population they came from.

The formulas for the sample mean and the population mean only differ in mathematical
notation. Population attributes use capital letters while sample attributes use lowercase letters.
Population mean

 = population mean
 = sum of each value in the population

= number of values in the population

The population mean can also be denoted as μ.

Sample mean

 = sample mean

 = sum of each value in the sample

= number of values in the population


The sample mean is also referred to as M.

Steps for calculating the mean by hand


There are two steps for calculating the mean:

1. Add up all the values in the data set.


2. Divide this number by the number of values.

We’ll walk through these steps with a sample data set.

Let’s say you want to find the average amount people spend on a restaurant meal in your
neighborhood. You ask a sample of 8 neighbors how much they spent the last time they went
out for dinner, and find the mean cost.

Data set

Cost of dinner for two (USD) 42 13 31 87 24 58 76 69

Step 1: Find the sum of the values by adding them all up


Because we’re working with a sample, we use the sample formula.
Formul Calculation
a

42 + 13 + 31 + 87 + 24 + 58 + 76 + 69
= 400

Step 2: Divide the sum by the number of values


In the formula, n is the number of values in your data set. Our data set has 8 values.

Formula Calculation

=8
= 400
= 400 8
= 50

The mean tells us that in our sample, participants spent an average of 50 USD on their
restaurant bill.

Outlier effect on the mean


Outliers are extreme values that differ from most values in the data set. Because all values
are used in the calculation of the mean, an outlier can have a dramatic effect on the mean by
pulling the mean away from the majority of the values.

Let’s see what happens to the mean when we add an outlier to our data set.

Data set
Cost of dinner for two (USD) 42 13 31 87 24 58 76 69 230

Step 1: Find the sum of the values by adding them all up


Formula Calculation

42 + 13 + 31 + 87 + 24 + 58 + 76 + 69 + 230
= 630
Step 2: Divide the sum by the number of values
Formula Calculation

=9
= 630
= 630 9
= 70

As we can see, adding just one outlier to our data set raised the mean by 20 USD. In this case,
a different measure of central tendency, like the median, would be more appropriate.

When can you use the mean, median or mode?


The mean is the most widely used measure of central tendency because it uses all values in its
calculation. The best measure of central tendency depends on your type of variable and the
shape of your distribution.

Type of variable
The mean can only be calculated for quantitative variables (e.g., height), and it can’t be found
for categorical variables (e.g., gender).

In categorical variables, data is placed into groupings without exact numerical values, so the
mean cannot be calculated. For categorical variables, the mode is the best measure of central
tendency because it tells you the most common characteristic or popular choice for your
sample.

But for continuous or discrete variables, you have exact numerical values. With these, you
can easily calculate the mean or median.

Distribution shape
The mean is best for data sets with normal distributions. In a normal distribution, data is
symmetrically distributed with no skew. Most values cluster around a central region, with
values tapering off as they go further away from the center.

The mean, mode and median are exactly the same in a normal distribution.
In skewed distributions, more values fall on one side of the center than the other, and the
mean, median and mode all differ from each other. One side has a more spread out and longer
tail with fewer scores at one end than the other.
For skewed distributions and distributions with outliers, the mean is easily influenced by
extreme values and may not accurately represent the central tendency. The median is a better
measure for these distributions as it takes a value from the middle of the distribution.

Alternatively, you can systematically review and remove outliers from your dataset in
the data cleansing process.

Mode
The mode is the most frequently occurring value in the dataset. It’s possible to have no mode,
one mode, or more than one mode.

To find the mode, sort your dataset numerically or categorically and select the response that
occurs most frequently.

Example: Finding the modeIn a survey, you ask 9 participants whether they identify as
conservative, moderate, or liberal.
To find the mode, sort your data by category and find which response was chosen most
frequently.

To make it easier, you can create a frequency table to count up the values for each category.

Political ideology Frequency

Conservative 2

Moderate 3

Liberal 4

Mode: Liberal
The mode is easily seen in a bar graph because it is the value with the highest bar.
When to use the mode
The mode is most applicable to data from a nominal level of measurement. Nominal data is
classified into mutually exclusive categories, so the mode tells you the most popular
category.

For continuous variables or ratio levels of measurement, the mode may not be a helpful
measure of central tendency. That’s because there are many more possible values than there
are in a nominal or ordinal level of measurement. It’s unlikely for a value to repeat in a
ratio level of measurement.

Example: Ratio data with no modeYou collect data on reaction times in a computer task, and
your dataset contains values that are all different from each other.
Participant 1 2 3 4 5 6 7 8 9

Reaction time (milliseconds) 267 345 421 324 401 312 382 298 303

In this dataset, there is no mode, because each value occurs only once.

Median
The median of a dataset is the value that’s exactly in the middle when it is ordered from low
to high.

Example: Finding the medianYou measure the reaction times of 7 participants on a computer
task and categorize them into 3 groups: slow, medium or fast.
Participant 1 2 3 4 5 6 7

Speed Medium Slow Fast Fast Medium Fast Slow

To find the median, you first order all values from low to high. Then, you find the value in
the middle of the ordered dataset—in this case, the value in the 4th position.

Ordered dataset Slow Slow Medium Medium Fast Fast Fast

Median: Medium
In larger datasets, it’s easier to use simple formulas to figure out the position of the middle
value in the distribution. You use different methods to find the median of a dataset depending
on whether the total number of values is even or odd.

Median of an odd-numbered dataset

For an odd-numbered dataset, find the value that lies at the position, where n is the
number of values in the dataset.

ExampleYou measure the reaction times in milliseconds of 5 participants and order the
dataset.
Reaction time (milliseconds) 287 298 345 365 380

The middle position is calculated using , where n = 5.

That means the median is the 3rd value in your ordered dataset.

Median: 345 milliseconds

Median of an even-numbered dataset


For an even-numbered dataset, find the two values in the middle of the dataset: the values at

the and positions. Then, find their mean.

ExampleYou measure the reaction times of 6 participants and order the dataset.
Reaction time (milliseconds) 287 298 345 357 365 380

The middle positions are calculated using and , where n = 6.

That means the middle values are the 3rd value, which is 345, and the 4th value, which
is 357.

To get the median, take the mean of the 2 middle values by adding them together and
dividing by 2.
Median: 351 milliseconds

Mean
The arithmetic mean of a dataset (which is different from the geometric mean) is the sum of
all values divided by the total number of values. It’s the most commonly used measure of
central tendency because all values are used in the calculation.

Example: Finding the mean


Participant 1 2 3 4 5

Reaction time (milliseconds) 287 345 365 298 380

First you add up the sum of all values:

Then you calculate the mean using the formula

There are 5 values in the dataset, so n = 5.

Mean (x̄ ): 335 milliseconds

Outlier effect on the mean


Outliers can significantly increase or decrease the mean when they are included in the
calculation. Since all values are used to calculate the mean, it can be affected by extreme
outliers. An outlier is a value that differs significantly from the others in a dataset.

Example: Mean with an outlierIn this dataset, we swap out one value with an extreme outlier.
Participant 1 2 3 4 5

Reaction time (milliseconds) 832 345 365 298 380


Due to the outlier, the mean ( ) becomes much higher, even though all the other numbers in
the dataset stay the same.

Mean: 444 milliseconds

Population versus sample mean


A dataset contains values from a sample or a population. A population is the entire group that
you are interested in researching, while a sample is only a subset of that population.

While data from a sample can help you make estimates about a population, only full
population data can give you the complete picture.

In statistics, the notation of a sample mean and a population mean and their formulas are
different. But the procedures for calculating the population and sample means are the same.

Sample mean formulaThe sample mean is written as M or x̄ (pronounced x-bar). For


calculating the mean of a sample, use this formula:

 x̄ : sample mean
 : sum of all values in the sample dataset
 n: number of values in the sample dataset

Population mean formulaThe population mean is written as μ (Greek term mu). For
calculating the mean of a population, use this formula:

 μ: population mean
 : sum of all values in the population dataset
 N: number of values in the population dataset

Prevent plagiarism. Run a free check.


Try for free

When should you use the mean, median or mode?


The 3 main measures of central tendency are best used in combination with each other
because they have complementary strengths and limitations. But sometimes only 1 or 2 of
them are applicable to your dataset, depending on the level of measurement of the variable.

 The mode can be used for any level of measurement, but it’s most meaningful for
nominal and ordinal levels.
 The median can only be used on data that can be ordered – that is, from ordinal,
interval and ratio levels of measurement.
 The mean can only be used on interval and ratio levels of measurement because it
requires equal spacing between adjacent values or scores in the scale.

Levels of measurement Examples Measure of central tendency

Nominal  Ethnicity  Mode


 Political ideology

Ordinal  Level of anxiety  Mode


 Income bracket  Median

Interval and ratio  Reaction time  Mode


 Test score  Median
 Temperature  Mean

To decide which measures of central tendency to use, you should also consider the
distribution of your dataset.

For normally distributed data, all three measures of central tendency will give you the same
answer so they can all be used.

In skewed distributions, the median is the best measure because it is unaffected by


extreme outliers or non-symmetric distributions of scores. The mean and mode can vary in
skewed distributions.

Mode
The mode or modal value of a data set is the most frequently occurring value. It’s a measure
of central tendency that tells you the most popular choice or most common characteristic of
your sample.

When reporting descriptive statistics, measures of central tendency help you find the middle
or the average of your data set. The three most common measures of central tendency are the
mode, median, and mean.

How many modes can you have?


A data set can often have no mode, one mode or more than one mode – it all depends on how
many different values repeat most frequently.

Your data can be:

 without any mode


 unimodal, with one mode,
 bimodal, with two modes,
 trimodal, with three modes, or
 multimodal, with four or more modes.
You can calculate the mode by hand or with the help of our mode calculator below.

Find the mode (by hand)


To find the mode, follow these two steps:

1. If the data for your variable takes the form of numerical values, order the values from
low to high. If it takes the form of categories or groupings, sort the values by group, in
any order.
2. Identify the value or values that occur most frequently.

Numerical mode example


Your data set is the ages of 6 college students.

Data set
Participant A B C D E F

Age 19 22 20 21 22 23

By ordering the values from low to high, we can easily see the value that occurs most
frequently.

Ordered data set


Age 19 20 21 22 22 23

The mode of this data set is 22.

Categorical mode example


Your data set contains the highest education levels of the participants’ parents.

Data set
Participant A B C D E F

Parents’ education Bachelor’s Master’s High school Bachelor’s Doctoral Master’s


level degree degree diploma degree degree degree

To sort the values by group, you create a simple frequency table. Place the categories on the
left hand side and the frequencies on the right hand side.

Frequency table
Parents’ education level Frequency

Bachelor’s degree 2

Master’s degree 2
High school diploma 1

Doctoral degree 1

From the table, you can see that there are two modes. This means you have a bimodal data
set.

The modes are Bachelor’s degree and Master’s degree.

Find the mode with grouped data

A grouped frequency table organizes large numerical data sets into intervals or classes of
values and reports the frequency of values in each class.

For grouped data, you can report the mode in two ways:

 the modal class is the grouping with the highest frequency of values.
 the modal value is estimated as the midpoint of the modal class.

The mode is only an estimate in this case, because the actual values within the modal class
are unknown.

Modal class and modal value example


You have a data set that includes the average reaction times of participants. You organize the
data into a frequency table.

Reaction times are placed in classes of 100 milliseconds each. The frequency column shows
the number of participants within each class.

Grouped frequency table


Reaction time (milliseconds) Frequency

200–299 6

300–399 13

400–499 17

500–599 25

600–699 21
700–799 12

800–899 4

You can visualize your data set by plotting your data on a histogram. The mode is the value
with the highest peak on a histogram or bar chart.

From your table or histogram, you can see that the modal class – the group in which values
appear most frequently – is 500–599 milliseconds. Therefore, the mode is estimated to be at
the midpoint of this class: 550 milliseconds.

Importantly, the choice of intervals in grouped data can have a large impact on the mode. For
example, changing the intervals from 100 ms long to 50 or 200 ms long could result in
completely different modes.

When to use the mode


The level of measurement of your variables determines when you should use the mode.

The mode works best with categorical data. It is the only measure of central tendency
for nominal variables, where it can reflect the most commonly found characteristic (e.g.,
demographic information). The mode is also useful with ordinal variables – for example, to
reflect the most popular answer on a ranked scale (e.g., level of agreement).

For quantitative data, such as reaction time or height, the mode may not be a helpful measure
of central tendency. That’s because there are often many more possible values for
quantitative data than there are for categorical data, so it’s unlikely for values to repeat.

Example of quantitative data with no mode


You collect data on reaction times in a computer task, and your data set contains values that
are all different from each other.

Data set with no mode


Reaction time (milliseconds) 267 345 421 324 401 312 382 298 303

In this data set, there is no mode, because each value occurs only once.

Median
The median is the value that’s exactly in the middle of a dataset when it is ordered. It’s a
measure of central tendency that separates the lowest 50% from the highest 50% of values.

The steps for finding the median differ depending on whether you have an odd or an even
number of data points. If there are two numbers in the middle of a dataset, their mean is the
median.

The median is usually used with quantitative data (where the values are numerical), but you
can sometimes also find the median for an ordinal dataset (where the values are ranked
categories).

You can calculate the median by hand or with the help of our median calculator below.

Find the median with an odd-numbered dataset


We’ll walk through steps using a small sample dataset with the weekly pay of 5 people.

Dataset
Weekly pay (USD) 350 800 220 500 130

Step 1: Order the values from low to high.


Ordered dataset
Weekly pay (USD) 130 220 350 500 800

Step 2: Calculate the middle position.

Use the formula , where n is the number of values in your dataset.

Calculating the middle position


Formula Calculation

The median is the value at the 3rd position.

Step 3: Find the value in the middle position.


Finding the median
Weekly pay (USD) 130 220 350 500 800

The median weekly pay is 350 US dollars.

Find the median with an even-numbered dataset


In an even-numbered dataset, there isn’t a single value in the middle of the dataset, so we
have to follow a slightly different procedure.

Let’s add another value to the dataset. Now you have 6 values.

Dataset
Weekly pay (USD) 350 800 220 500 130 1150

Step 1: Order the values from low to high.


Ordered dataset
Weekly pay (USD) 130 220 350 500 800 1150

Step 2: Calculate the two middle positions.

The middle positions are found using the formulas and , where n is the number of
values in your dataset.

Calculating the middle positions

Calculatio
Formula

The middle values are at the 3rd and 4th positions.

Step 3: Find the two middle values.


Middle values
Weekly pay (USD) 130 220 350 500 800 1150

The middle values are 350 and 500.

Step 4: Find the mean of the two middle values.


To find the median, calculate the mean by adding together the middle values and dividing
them by two.

Calculating the medianMedian:


The median weekly pay for this dataset is is 425 US dollars.

Find the median with ordinal data


The median is usually used for quantitative data, which means the values in the dataset are
numerical. But you can sometimes also identify the median for ordinal data.

Ordinal data is organized into categories with a rank order – for example language ability
level (beginner, intermediate, or fluent) or level of agreement (strongly agree, agree, etc.).

The process for finding the median is almost the same.

Odd-numbered dataset
We’ll walk through the steps for an odd-numbered ordinal dataset with 7 values.

You categorize reaction times of participants into 3 groups: slow, medium or fast.
First, order all values in ascending order.

Ordered dataset
Reaction speed Slow Slow Medium Medium Fast Fast Fast

Next, find the middle value using , where n is the number of values in the dataset.

Calculating the middle position


Formula Calculation

The median is the value at the 4th position.

Finding the median


Reaction speed Slow Slow Medium Medium Fast Fast Fast

The median reaction speed is Medium.

Can you find the median for an even-numbered ordinal dataset?


The mean cannot be calculated for ordinal data, so the median can’t be found for an even-
numbered dataset.

For example, if the two middle values are “slow” and “medium,” you can’t calculate the
mean of these values.

In practice, ordinal data is sometimes converted into a numerical format and treated like
quantitative data for the sake of convenience. Then the mean of the middle values can be
calculated to find the median.

While this is considered acceptable in some contexts, it is not always seen as correct.

When should you use the median?


The median is the most informative measure of central tendency for skewed distributions or
distributions with outliers.

In skewed distributions, more values fall on one side of the center than the other, and the
mean, median and mode all differ from each other.

In a positively skewed distribution, there’s a cluster of lower scores and a spread out tail on
the right.
In a negatively skewed distribution, there’s a cluster of higher scores and a spread out tail on
the left.

Because the median only uses one or two values from the middle of a dataset, it’s unaffected
by extreme outliers or non-symmetric distributions of scores. In contrast, the positions of the
mean and mode can vary in skewed distributions.

For this reason, the median is often reported as a measure of central tendency for variables
such as income, because these distributions are usually positively skewed.

The level of measurement of your variable also determines whether you can use the median.
The median can only be used on data that can be ordered – that is,
from ordinal, interval and ratio levels of measurement.

Geometric Mean
The geometric mean is an average that multiplies all values and finds a root of the
number. For a dataset with n numbers, you find the nth root of their product. You can
use this descriptive statistic to summarize your data.

The geometric mean is an alternative to the arithmetic mean, which is often referred
to simply as “the mean.” While the arithmetic mean is based on adding values, the
geometric mean multiplies values.

The geometric mean formula can be written in two ways, but they are equivalent
mathematically.

 = product of …
 = every value
 = total number of values
 = reciprocal of

The symbol pi ( ) is similar to the summation sign sigma (Σ), but instead it tells you
to find the product of what follows after it by multiplying them all together.

In the first formula, the geometric mean is the nth root of the product of all values.

In the second formula, the geometric mean is the product of all values raised to the
power of the reciprocal of n.

These formulas are equivalent because of the laws of exponents: taking the nth root
of x is exactly the same as raising x to the power of 1/n.

Calculating the geometric mean


There are two main steps to calculating the geometric mean:

1. Multiply all values together to get their product.


2. Find the nth root of the product (n is the number of values).

Before calculating this measure of central tendency, note that:

 The geometric mean can only be found for positive values.


 If any value in the dataset is zero, the geometric mean is zero.

When should you use the geometric mean?


The geometric mean is best for reporting average inflation, percentage change, and
growth rates. Because these types of data are expressed as fractions, the geometric
mean is more accurate for them than the arithmetic mean.
While the arithmetic mean is appropriate for values that are independent from each
other (e.g., test scores), the geometric mean is more appropriate for dependent
values, percentages, fractions, or widely ranging data.

We’ll walk you through some examples showing how to find the geometric means of
different types of data.

Example: Geometric mean of percentages


You’re interested in the average voter turnout of the past five US elections. You’ve
gathered the following data.

Year 2000 2004 2008 2012 2016

Voter turnout (%) 50.3 55.7 57.1 54.9 60.1

Step 1: Multiply all values together to get their product.

Formula Calculation

Step 2: Find the nth root of the product (n is the number of values).

Formula Calculation

The average voter turnout of the past five US elections was 54.64%.

Example: Geometric mean of widely varying values


You compare the efficiency of two machines for three procedures that are assessed
on different scales. To find the mean efficiency of each machine, you find the
geometric and arithmetic means of their procedure rating scores.

Procedure 1 Procedure 2 Procedure 3

Machine A 7 80 2100

Machine B 3 94 2350
Geometric mean of Machine A
Step 1: Multiply all values together to get their product.

Formula Calculation

Step 2: Find the nth root of the product (n is the number of values).

Formula Calculation

Geometric mean of Machine B


Step 1: Multiply all values together to get their product.

Formula Calculation

Step 2: Find the nth root of the product (n is the number of values).

Formul Calculation
a

Comparing the means


Now you compare machine efficiency using arithmetic and geometric means.

Arithmetic Geometric mean


mean

Machine 729 105.55


A

Machine B 815.67 87.18


Arithmetic Geometric mean
mean

While the arithmetic means show higher efficiency for Machine B, the geometric
means show that Machine B is more efficient.

The geometric mean is more accurate here because the arithmetic mean is skewed
towards values that are higher than most of your dataset.

Geometric mean vs. arithmetic mean


The geometric mean is more accurate than the arithmetic mean for showing
percentage change over time or compound interest.

For example, say you study fruit fly population growth rates. You’re interested in
understanding how environmental factors change these rates.

You begin with 2 fruit flies, and every 12 days you measure the percentage increase
in the population.

Each percentage change value is also converted into a growth factor that is in
decimals. The growth factor includes the original value (100%), so to convert
percentage increase into a growth factor, add 100 to each percentage increase and
divide by 100.

Day 12 24 36

Percentage increase 340 187 427

Growth factor 4.4 2.87 5.27

First, you convert percentage change into decimals. You add 100 to each value to
factor in the original amount, and divide each value by 100.

Arithmetic mean
To find the arithmetic mean, add up all values and divide this number by n.

Formula Calculation
Geometric mean
Step 1: Multiply all values together to get their product.

Formul Calculation
a

Step 2: Find the nth root of the product (n is the number of values).

Formul Calculation
a

The arithmetic mean population growth factor is 4.18, while the geometric mean
growth factor is 4.05.

How do we know which mean is correct?


Because they are averages, multiplying the original number of flies with the mean
percentage change 3 times should give us the correct final population value for the
correct mean.

 Final population value: 2 × 4.4 × 2.87 × 5.27 ≈ 133 fruit flies


 Arithmetic mean of 418%: Final population = 2 × 4.18 × 4.18 × 4.18 ≈ 157
fruit flies
 Geometric mean of 405%: Final population = 2 × 4.05 × 4.05 × 4.05 ≈ 133
fruit flies

Only the geometric mean gives us the true number of fruit flies in the final population.
It’s the most accurate mean for the growth factor.

When is the geometric mean better than the


arithmetic mean?
Even though it’s less commonly used, the geometric mean is more accurate than the
arithmetic mean for positively skewed data and percentages.

In a positively skewed distribution, there’s a cluster of lower scores and a spread-out


tail on the right. Income distribution is a common example of a skewed dataset.

While most values tend to be low, the arithmetic mean is often pulled upward (or
rightward) by high values or outliers in a positively skewed dataset.
Because the geometric mean tends to be lower than the arithmetic mean, it
represents smaller values better than the arithmetic mean.

The geometric mean is most appropriate for ratio levels of measurement,


where variables have a true zero and don’t take on any negative values. Negative
percentage changes have to be framed positively: for instance, −8% becomes 92%
of the original value.

Measures of variability
Measures of variability give you a sense of how spread out the response values are. The
range, standard deviation and variance each reflect different aspects of spread.

Variability describes how far apart data points lie from each other and from the
center of a distribution. Along with measures of central tendency, measures of
variability give you descriptive statistics that summarize your data.

Variability is also referred to as spread, scatter or dispersion. It is most commonly


measured with the following:

 Range: the difference between the highest and lowest values


 Interquartile range: the range of the middle half of a distribution
 Standard deviation: average distance from the mean
 Variance: average of squared distances from the mean

While the central tendency, or average, tells you where most of your points lie,
variability summarizes how far apart they are. This is important because the amount
of variability determines how well you can generalize results from the sample to your
population.

Low variability is ideal because it means that you can better predict information
about the population based on sample data. High variability means that the values
are less consistent, so it’s harder to make predictions.
Data sets can have the same central tendency but different levels of variability
or vice versa. If you know only the central tendency or the variability, you can’t say
anything about the other aspect. Both of them together give you a complete picture
of your data.

Example: Variability in normal distributionsYou are investigating the amounts of time spent
on phones daily by different groups of people.
Using simple random samples, you collect data from 3 groups:

 Sample A: high school students,


 Sample B: college students,
 Sample C: adult full-time employees.

All three of your samples have the same average phone use, at 195 minutes or 3 hours and 15
minutes. This is the x-axis value where the peak of the curves are.

Although the data follows a normal distribution, each sample has different spreads. Sample A
has the largest variability while Sample C has the smallest variability.

Range
The range tells you the spread of your data from the lowest to the highest value in
the distribution. It’s the easiest measure of variability to calculate.
range is the spread of your data from the lowest to the highest value in the
distribution. It is a commonly used measure of variability.

Along with measures of central tendency, measures of variability give you descriptive
statistics for summarizing your data set.

The range is calculated by subtracting the lowest value from the highest value. While
a large range means high variability, a small range means low variability in a
distribution.

 R = range
 H = highest value
 L = lowest value

The range is the easiest measure of variability to calculate. To find the range, follow
these steps:

1. Order all values in your data set from low to high.


2. Subtract the lowest value from the highest value.

To find the range, simply subtract the lowest value from the highest value in the data
set.

Range exampleYour data set is the ages of 8 participants.


Participant 1 2 3 4 5 6 7 8

Age 37 19 31 29 21 26 33 36

First, order the values from low to high to identify the lowest value (L) and
the highest value (H).

Age 19 21 26 29 31 33 36 37

Then subtract the lowest from the highest value.

R=H–L

R = 37 – 19 = 18
The range of our data set is 18 years.

How useful is the range?


The range generally gives you a good indicator of variability when you have a
distribution without extreme values. When paired with measures of central tendency,
the range can tell you about the span of the distribution.

But the range can be misleading when you have outliers in your data set. One
extreme value in the data will give you a completely different range.

Range example with an outlierOne value in your data set is replaced with an outlier.
Age 19 21 26 29 31 33 36 61

Using the same calculation, we get a very different result this time:

R= H–L

R = 61 – 19 = 42

With an outlier, our range is now 42 years.


In the example above, the range indicates much more variability in the data than
there actually is. Although we have a large range, most values are actually clustered
around a clear middle.

Because only two numbers are used, the range is easily influenced by outliers. It
can’t tell you about the shape of the frequency distribution of values on its own.

NoteTo get a clear idea of your data’s variability, the range is best used in combination with
other measures of variability like interquartile range and standard deviation.

Range exampleYou have 8 data points from Sample A.


Data (minutes) 72 110 134 190 238 287 305 324

The highest value (H) is 324 and the lowest (L) is 72.

R=H–L

R = 324 – 72 = 252

The range of your data is 252 minutes.


Because only 2 numbers are used, the range is influenced by outliers and doesn’t
give you any information about the distribution of values. It’s best used in
combination with other measures.

Interquartile range
The interquartile range gives you the spread of the middle of your distribution.
For any distribution that’s ordered from low to high, the interquartile range contains
half of the values. While the first quartile (Q1) contains the first 25% of values, the
fourth quartile (Q4) contains the last 25% of values.

the interquartile range tells you the spread of the middle half of your distribution.

Quartiles segment any distribution that’s ordered from low to high into four equal
parts. The interquartile range (IQR) contains the second and third quartiles, or the
middle half of your data set. Whereas the range gives you the spread of the whole
data set, the interquartile range gives you the range of the middle half of a data set.
The interquartile range is the third quartile (Q3) minus the first quartile (Q1). This
gives us the range of the middle half of a data set.

Interquartile range exampleTo find the interquartile range of your 8 data points, you first
find the values at Q1 and Q3.
Multiply the number of values in the data set (8) by 0.25 for the 25th percentile (Q1) and by
0.75 for the 75th percentile (Q3).

Q1 position: 0.25 x 8 = 2

Q3 position: 0.75 x 8 = 6

Q1 is the value in the 2nd position, which is 110. Q3 is the value in the 6th position, which
is 287.

IQR = Q3 – Q1

IQR = 287 – 110 = 177

The interquartile range of your data is 177 minutes.


Just like the range, the interquartile range uses only 2 values in its calculation. But
the IQR is less affected by outliers: the 2 values come from the middle half of the
data set, so they are unlikely to be extreme scores.

The IQR gives a consistent measure of variability for skewed as well as normal
distributions.
Calculate the interquartile range by hand
The interquartile range is found by subtracting the Q1 value from the Q3 value:

Formula Explanation

 IQR = interquartile range


 Q3 = 3rd quartile or 75th percentile
 Q1 = 1st quartile or 25th percentile

Q1 is the value below which 25 percent of the distribution lies, while Q3 is the value
below which 75 percent of the distribution lies.

You can think of Q1 as the median of the first half and Q3 as the median of the
second half of the distribution.

Methods for finding the interquartile range


Although there’s only one formula, there are various different methods for identifying
the quartiles. You’ll get a different value for the interquartile range depending on the
method you use.

Here, we’ll discuss two of the most commonly used methods. These methods differ
based on how they use the median.

Exclusive method vs inclusive method


The exclusive method excludes the median when identifying Q1 and Q3, while
the inclusive method includes the median in identifying the quartiles.

The procedure for finding the median is different depending on whether your data set
is odd- or even-numbered.

 When you have an odd number of data points, the median is the value in the
middle of your data set. You can choose between the inclusive and exclusive
method.
 With an even number of data points, there are two values in the middle, so
the median is their mean. It’s more common to use the exclusive method in
this case.

While there is little consensus on the best method for finding the interquartile range,
the exclusive interquartile range is always larger than the inclusive interquartile
range.

The exclusive interquartile range may be more appropriate for large samples, while
for small samples, the inclusive interquartile range may be more representative
because it’s a narrower range.

Steps for the exclusive method


To see how the exclusive method works by hand, we’ll use two examples: one with
an even number of data points, and one with an odd number.

Even-numbered data set


We’ll walk through four steps using a sample data set with 10 values.

Step 1: Order your values from low to high.

Step 2: Locate the median, and then separate the values below it from the values above it.

With an even-numbered data set, the median is the mean of the two values in the middle, so you simply
divide your data set into two halves.

Step 3: Find Q1 and Q3.

Q1 is the median of the first half and Q3 is the median of the second half. Since each of these halves have
an odd number of values, there is only one value in the middle of each half.

Step 4: Calculate the interquartile range.


Odd-numbered data set
This time we’ll use a data set with 11 values.

Step 1: Order your values from low to high.

Step 2: Locate the median, and then separate the values below it from the values above it.

In an odd-numbered data set, the median is the number in the middle of the list. The median itself is
excluded from both halves: one half contains all values below the median, and the other contains all the
values above it.

Step 3: Find Q1 and Q3.


Q1 is the median of the first half and Q3 is the median of the second half. Since each of these halves have
an odd-numbered size, there is only one value in the middle of each half.

Step 4: Calculate the interquartile range.


Steps for the inclusive method
Almost all of the steps for the inclusive and exclusive method are identical. The
difference is in how the data set is separated into two halves.

The inclusive method is sometimes preferred for odd-numbered data sets because it
doesn’t ignore the median, a real value in this type of data set.

Step 1: Order your values from low to high.

Step 2: Find the median.

The median is the number in the middle of the data set.

Step 2: Separate the list into two halves, and include the median in both halves.
The median is included as the highest value in the first half and the lowest value in the second half.

Step 3: Find Q1 and Q3.


Q1 is the median of the first half and Q3 is the median of the second half. Since the two halves each
contain an even number of values, Q1 and Q3 are calculated as the means of the middle values.
Step 4: Calculate the interquartile range.

We can see from these examples that using the inclusive method gives us a smaller
IQR. With the same data set, the exclusive IQR is 24, and the inclusive IQR is 20.

When is the interquartile range useful?


The interquartile range is an especially useful measure of variability for skewed
distributions.

For these frequency distributions, the median is the best measure of central
tendency because it’s the value exactly in the middle when all values are ordered
from low to high.

Along with the median, the IQR can give you an overview of where most of your
values lie and how clustered they are.

The IQR is also useful for datasets with outliers. Because it’s based on the middle
half of the distribution, it’s less influenced by extreme values.

Visualize the interquartile range in boxplots


A boxplot, or a box-and-whisker plot, summarizes a data set visually using a five-
number summary.

Every distribution can be organized using these five numbers:

 Lowest value
 Q1: 25th percentile
 Median
 Q3: 75th percentile
 Highest value (Q4)
The vertical lines in the box show Q1, the median, and Q3, while the whiskers at the
ends show the highest and lowest values.

In a boxplot, the width of the box shows you the interquartile range. A smaller width
means you have less dispersion, while a larger width means you have more
dispersion.

An inclusive interquartile range will have a smaller width than an exclusive


interquartile range.
Boxplots are especially useful for showing the central tendency and dispersion of
skewed distributions.

The placement of the box tells you the direction of the skew. A box that’s much
closer to the right side means you have a negatively skewed distribution, and a box
closer to the left side tells you that you have a positively skewed distribution.
Other interesting article

Five-number summary
Every distribution can be organized using a five-number summary:

 Lowest value
 Q1: 25th percentile
 Q2: the median
 Q3: 75th percentile
 Highest value (Q4)

These five-number summaries can be easily visualized using box and whisker plots.

Box and whisker plot exampleFor each of our samples, the horizontal lines in a box show Q1,
the median and Q3, while the whiskers at the end show the highest and lowest values.
Standard deviation
The standard deviation is the average amount of variability in your dataset.

It tells you, on average, how far each score lies from the mean. The larger the
standard deviation, the more variable the data set is.

There are six steps for finding the standard deviation by hand:

1. List each score and find their mean.


2. Subtract the mean from each score to get the deviation from the mean.
3. Square each of these deviations.
4. Add up all of the squared deviations.
5. Divide the sum of the squared deviations by n – 1 (for a sample) or N (for a
population).
6. Find the square root of the number you found.

Standard deviation example


Step 1: Data (minutes) Step 2: Deviation from mean Steps 3 + 4: Squared deviation

72 72 – 207.5 = -135.5 18360.25


Step 1: Data (minutes) Step 2: Deviation from mean Steps 3 + 4: Squared deviation

110 110 – 207.5 = -97.5 9506.25

134 134 – 207.5 = -73.5 5402.25

190 190 – 207.5 = -17.5 306.25

238 238 – 207.5 = 30.5 930.25

287 287 – 207.5 = 79.5 6320.25

305 305 – 207.5 = 97.5 9506.25

324 324 – 207.5 = 116.5 13572.25

Mean = 207.5 Sum = 0 Sum of squares = 63904

Standard deviation formula for populations


If you have data from the entire population, use the population standard deviation
formula:

Formula Explanation

 = population standard deviation


 = sum of…
 = each value
 = population mean
 = number of values in the
population

Standard deviation formula for samples


If you have data from a sample, use the sample standard deviation formula:

Formula Explanation

 = sample standard deviation


 = sum of…
 = each value
 = sample mean
Formula Explanation

 = number of values in the sample

Why use n – 1 for sample standard deviation?


Samples are used to make statistical inferences about the population that they came
from.

When you have population data, you can get an exact value for population standard
deviation. Since you collect data from every population member, the standard
deviation reflects the precise amount of variability in your distribution, the population.

But when you use sample data, your sample standard deviation is always used as
an estimate of the population standard deviation. Using n in this formula tends to
give you a biased estimate that consistently underestimates variability.

Reducing the sample n to n – 1 makes the standard deviation artificially large, giving
you a conservative estimate of variability.

While this is not an unbiased estimate, it is a less biased estimate of standard


deviation: it is better to overestimate rather than underestimate variability in samples.

The difference between biased and conservative estimates of standard deviation


gets much smaller when you have a large sample size.

Variance
The variance is the average of squared deviations from the mean. A deviation from
the mean is how far a score lies from the mean.

Variance is the square of the standard deviation. This means that the units of
variance are much larger than those of a typical value of a data set.

While it’s harder to interpret the variance number intuitively, it’s important to calculate
variance for comparing different data sets in statistical tests like ANOVAs.

Variance reflects the degree of spread in the data set. The more spread the data, the
larger the variance is in relation to the mean.

Variance exampleTo get variance, square the standard deviation.


s = 95.5

s2 = 95.5 x 95.5 = 9129.14

The variance of your data is 9129.14.


To find the variance by hand, perform all of the steps for standard deviation except
for the final step.
Variance for populations
Formula Explanation

 = population variance
 = sum of…
 = each value
 = population mean
 = number of values in the
population

Variance for samples


Formula Explanation

 = sample variance
 = sum of…
 = each value
 = sample mean
 = number of values in the sample

Biased versus unbiased estimates of variance


An unbiased estimate in statistics is one that doesn’t consistently give you either
high values or low values – it has no systematic bias.

Just like for standard deviation, there are different formulas for population and
sample variance. But while there is no unbiased estimate for standard deviation,
there is one for sample variance.

If the sample variance formula used the sample n, the sample variance would be
biased towards lower numbers than expected. Reducing the sample n to n – 1
makes the variance artificially larger.

In this case, bias is not only lowered but totally removed. The sample variance
formula gives completely unbiased estimates of variance.

So why isn’t the sample standard deviation also an unbiased estimate?

That’s because sample standard deviation comes from finding the square root of
sample variance. Since a square root isn’t a linear operation, like addition or
subtraction, the unbiasedness of the sample variance formula isn’t carried over the
sample standard deviation formula.

What’s the best measure of variability?


The best measure of variability depends on your level of measurement and
distribution.
Level of measurement
For data measured at an ordinal level, the range and interquartile range are the only
appropriate measures of variability.

For more complex interval and ratio levels, the standard deviation and variance are
also applicable.

Distribution
For normal distributions, all measures can be used. The standard deviation and
variance are preferred because they take your whole data set into account, but this
also means that they are easily influenced by outliers.

For skewed distributions or data sets with outliers, the interquartile range is the best
measure. It’s least affected by extreme values because it focuses on the spread in
the middle of the data set.

Range
The range gives you an idea of how far apart the most extreme response scores are. To find
the range, simply subtract the lowest value from the highest value.

The range is a simple measure that tells you the spread of values in a data set. It
has a simple definition:
Range = maximum value – minimum value
So if you have a set of data such as 4, 2, 5, 8, 12, 15, the range is the highest
number (15) minus the lowest number (2). In this case:
Range = 15-2 = 13
This example tells you that the data set spans 13 numbers. In a box and whisker
plot, the ends of the whiskers give you a visual indication of the range, because
they mark the minimum and maximum values. A large range suggests a wide
spread of results, and a small range suggests data that is closely centered around a
specific value.

Range of visits to the library in the past yearOrdered data set: 0, 3, 3, 12, 15, 24
Range: 24 – 0 = 24

Standard deviation
The standard deviation (s or SD) is the average amount of variability in your dataset. It tells
you, on average, how far each score lies from the mean. The larger the standard deviation, the
more variable the data set is.

The standard deviation is the average amount of variability in your dataset. It tells
you, on average, how far each value lies from the mean.

A high standard deviation means that values are generally far from the mean, while a
low standard deviation indicates that values are clustered close to the mean.

What does standard deviation tell you?


Standard deviation is a useful measure of spread for normal distributions.

In normal distributions, data is symmetrically distributed with no skew. Most values


cluster around a central region, with values tapering off as they go further away from
the center. The standard deviation tells you how spread out from the center of the
distribution your data is on average.

Many scientific variables follow normal distributions, including height, standardized


test scores, or job satisfaction ratings. When you have the standard deviations of
different samples, you can compare their distributions using statistical tests to make
inferences about the larger populations they came from.

Example: Comparing different standard deviationsYou collect data on job satisfaction ratings
from three groups of employees using simple random sampling.
The mean (M) ratings are the same for each group – it’s the value on the x-axis when the
curve is at its peak. However, their standard deviations (SD) differ from each other.

The standard deviation reflects the dispersion of the distribution. The curve with the lowest
standard deviation has a high peak and a small spread, while the curve with the highest
standard deviation is more flat and widespread.
The empirical rule
The standard deviation and the mean together can tell you where most of the values
in your frequency distribution lie if they follow a normal distribution.

The empirical rule, or the 68-95-99.7 rule, tells you where your values lie:

 Around 68% of scores are within 1 standard deviation of the mean,


 Around 95% of scores are within 2 standard deviations of the mean,
 Around 99.7% of scores are within 3 standard deviations of the mean.

Example: Standard deviation in a normal distributionYou administer a memory recall test to a


group of students. The data follows a normal distribution with a mean score of 50 and a
standard deviation of 10.
Following the empirical rule:

 Around 68% of scores are between 40 and 60.


 Around 95% of scores are between 30 and 70.
 Around 99.7% of scores are between 20 and 80.
The empirical rule is a quick way to get an overview of your data and check for
any outliers or extreme values that don’t follow this pattern.

NoteFor non-normal distributions, the standard deviation is a less reliable measure of


variability and should be used in combination with other measures like
the range or interquartile range.

Standard deviation formulas for populations and


samples
Different formulas are used for calculating standard deviations depending on
whether you have collected data from a whole population or a sample.

Population standard deviation


When you have collected data from every member of the population that you’re
interested in, you can get an exact value for population standard deviation.

The population standard deviation formula looks like this:


Formula Explanation

 = population standard deviation


 = sum of…
 = each value
 = population mean
 = number of values in the
population

Sample standard deviation


When you collect data from a sample, the sample standard deviation is used
to make estimates or inferences about the population standard deviation.

The sample standard deviation formula looks like this:

Formula Explanation

 = sample standard deviation


 = sum of…
 = each value
 = sample mean
 = number of values in the sample

With samples, we use n – 1 in the formula because using n would give us a biased
estimate that consistently underestimates variability. The sample standard deviation
would tend to be lower than the real standard deviation of the population.

Reducing the sample n to n – 1 makes the standard deviation artificially large, giving
you a conservative estimate of variability.

While this is not an unbiased estimate, it is a less biased estimate of standard


deviation: it is better to overestimate rather than underestimate variability in samples.

Steps for calculating the standard deviation by hand


The standard deviation is usually calculated automatically by whichever software you
use for your statistical analysis. But you can also calculate it by hand to better
understand how the formula works.

There are six main steps for finding the standard deviation by hand. We’ll use a
small data set of 6 scores to walk through the steps.

Data set

46 6 32 60 5 41
9 2
Step 1: Find the mean
To find the mean, add up all the scores, then divide them by the number of scores.

Mean (x̅ )

Step 2: Find each score’s deviation from the mean


Subtract the mean from each score to get the deviations from the mean.

Since x̅ = 50, here we take away 50 from each score.

Score Deviation from the mean

46 46 – 50 = -4

69 69 – 50 = 19

32 32 – 50 = -18

60 60 – 50 = 10

52 52 – 50 = 2

41 41 – 50 = -9

Step 3: Square each deviation from the mean


Multiply each deviation from the mean by itself. This will result in positive numbers.

Squared deviations from the mean

(-4)2 = 4 × 4 = 16

192 = 19 × 19 = 361

(-18)2 = -18 × -18 = 324


Squared deviations from the mean

102 = 10 × 10 = 100

22 = 2 × 2 = 4

(-9)2 = -9 × -9 = 81

Step 4: Find the sum of squares


Add up all of the squared deviations. This is called the sum of squares.

Sum of squares

16 + 361 + 324 + 100 + 4 + 81


= 886

Step 5: Find the variance


Divide the sum of the squares by n – 1 (for a sample) or N (for a population) – this is
the variance.

Since we’re working with a sample size of 6, we will use n – 1, where n = 6.

Variance

Step 6: Find the square root of the variance


To find the standard deviation, we take the square root of the variance.

Standard deviation

From learning that SD = 13.31, we can say that each score deviates from the mean
by 13.31 points on average.
Why is standard deviation a useful measure of
variability?
Although there are simpler ways to calculate variability, the standard deviation
formula weighs unevenly spread out samples more than evenly spread samples. A
higher standard deviation tells you that the distribution is not only more spread out,
but also more unevenly spread out.

This means it gives you a better idea of your data’s variability than simpler
measures, such as the mean absolute deviation (MAD).

The MAD is similar to standard deviation but easier to calculate. First, you express
each deviation from the mean in absolute values by converting them into positive
numbers (for example, -3 becomes 3). Then, you calculate the mean of these
absolute deviations.

Unlike the standard deviation, you don’t have to calculate squares or square roots of
numbers for the MAD. However, for that reason, it gives you a less precise measure
of variability.

Let’s take two samples with the same central tendency but different amounts of
variability. Sample B is more variable than Sample A.

Values Mean Mean absolute deviation Standard deviation

Sample A 66, 30, 40, 64 50 15 17.8

Sample B 51, 21, 79, 49 50 15 23.7

For samples with equal average deviations from the mean, the MAD can’t
differentiate levels of spread. The standard deviation is more precise: it is higher for
the sample with more variability in deviations from the mean.

By squaring the differences from the mean, standard deviation reflects uneven
dispersion more accurately. This step weighs extreme deviations more heavily than
small deviations.

However, this also makes the standard deviation sensitive to outliers.

There are six steps for finding the standard deviation:

1. List each score and find their mean.


2. Subtract the mean from each score to get the deviation from the mean.
3. Square each of these deviations.
4. Add up all of the squared deviations.
5. Divide the sum of the squared deviations by N – 1.
6. Find the square root of the number you found.

Standard deviations of visits to the library in the past yearIn the table below, you
complete Steps 1 through 4.
Raw Deviation from mean Squared deviation
data

15 15 – 9.5 = 5.5 30.25

3 3 – 9.5 = -6.5 42.25

12 12 – 9.5 = 2.5 6.25

0 0 – 9.5 = -9.5 90.25

24 24 – 9.5 = 14.5 210.25

3 3 – 9.5 = -6.5 42.25

M = 9.5 Sum = 0 Sum of squares = 421.5

Step 5: 421.5/5 = 84.3

Step 6: √84.3 = 9.18

From learning that s = 9.18, you can say that on average, each score deviates from the mean
by 9.18 points.

Variance
The variance is the average of squared deviations from the mean. Variance reflects the degree
of spread in the data set. The more spread the data, the larger the variance is in relation to the
mean.

To find the variance, simply square the standard deviation. The symbol for variance is s2.

Variance of visits to the library in the past yearData set: 15, 3, 12, 0, 24, 3
s = 9.18

s2 = 84.3

The variance is a measure of variability. It is calculated by taking the average of


squared deviations from the mean.
Variance tells you the degree of spread in your data set. The more spread the data,
the larger the variance is in relation to the mean.

The standard deviation is derived from variance and tells you, on average, how far
each value lies from the mean. It’s the square root of variance.

Both measures reflect variability in a distribution, but their units differ:

 Standard deviation is expressed in the same units as the original values


(e.g., meters).
 Variance is expressed in much larger units (e.g., meters squared)

Since the units of variance are much larger than those of a typical value of a data
set, it’s harder to interpret the variance number intuitively. That’s why standard
deviation is often preferred as a main measure of variability.

However, the variance is more informative about variability than the standard
deviation, and it’s used in making statistical inferences.

Population vs. sample variance


Different formulas are used for calculating variance depending on whether you have
data from a whole population or a sample.

Population variance
When you have collected data from every member of the population that you’re
interested in, you can get an exact value for population variance.

The population variance formula looks like this:

Formula Explanation

 = population variance
 = sum of…
 Χ = each value
 = population mean
 Ν = number of values in the population

Sample variance
When you collect data from a sample, the sample variance is used to make
estimates or inferences about the population variance.

The sample variance formula looks like this:


Formula Explanation

 = sample variance
 = sum of…
 Χ = each value
 = sample mean
 n = number of values in the sample

With samples, we use n – 1 in the formula because using n would give us a biased
estimate that consistently underestimates variability. The sample variance would
tend to be lower than the real variance of the population.

Reducing the sample n to n – 1 makes the variance artificially large, giving you an
unbiased estimate of variability: it is better to overestimate rather than underestimate
variability in samples.

It’s important to note that doing the same thing with the standard deviation formulas
doesn’t lead to completely unbiased estimates. Since a square root isn’t a linear
operation, like addition or subtraction, the unbiasedness of the sample variance
formula doesn’t carry over the sample standard deviation formula.

Steps for calculating the variance by hand


The variance is usually calculated automatically by whichever software you use for
your statistical analysis. But you can also calculate it by hand to better understand
how the formula works.

There are five main steps for finding the variance by hand. We’ll use a small data set
of 6 scores to walk through the steps.

Data set

46 6 32 60 5 41
9 2

Step 1: Find the mean


To find the mean, add up all the scores, then divide them by the number of scores.

Mean ( )

= (46 + 69 + 32 + 60 + 52 + 41) 6
= 50
Step 2: Find each score’s deviation from the mean
Subtract the mean from each score to get the deviations from the mean.

Since x̅ = 50, take away 50 from each score.

Score Deviation from the mean

46 46 – 50 = -4

69 69 – 50 = 19

32 32 – 50 = -18

60 60 – 50 = 10

52 52 – 50 = 2

41 41 – 50 = -9

Step 3: Square each deviation from the mean


Multiply each deviation from the mean by itself. This will result in positive numbers.

Squared deviations from the mean

(-4)2 = 4 × 4 = 16

192 = 19 × 19 = 361

(-18)2 = -18 × -18 = 324

102 = 10 × 10 = 100

22 = 2 × 2 = 4

(-9)2 = -9 × -9 = 81

Step 4: Find the sum of squares


Add up all of the squared deviations. This is called the sum of squares.
Sum of squares

16 + 361 + 324 + 100 + 4 + 81


= 886

Step 5: Divide the sum of squares by n – 1 or N


Divide the sum of the squares by n – 1 (for a sample) or N (for a population).

Since we’re working with a sample, we’ll use n – 1, where n = 6.

Variance

886 (6 – 1) = 886 5
= 177.2

Why does variance matter?


Variance matters for two main reasons:

 Parametric statistical tests are sensitive to variance.


 Comparing the variance of samples helps you assess group differences.

Homogeneity of variance in statistical tests


Variance is important to consider before performing parametric tests. These tests
require equal or similar variances, also called homogeneity of variance or
homoscedasticity, when comparing different samples.

Uneven variances between samples result in biased and skewed test results. If you
have uneven variances across samples, non-parametric tests are more appropriate.

Using variance to assess group differences


Statistical tests like variance tests or the analysis of variance (ANOVA) use sample
variance to assess group differences. They use the variances of the samples to
assess whether the populations they come from differ from each other.

Research exampleAs an education researcher, you want to test the hypothesis that different
frequencies of quizzes lead to different final scores of college students. You collect the final
scores from three groups with 20 students each that had quizzes frequently, infrequently, or
rarely over a semester.

 Sample A: Once a week


 Sample B: Once every 3 weeks
 Sample C: Once every 6 weeks

To assess group differences, you perform an ANOVA.


The main idea behind an ANOVA is to compare the variances between groups and
variances within groups to see whether the results are best explained by the group
differences or by individual differences.

If there’s higher between-group variance relative to within-group variance, then the


groups are likely to be different as a result of your treatment. If not, then the results
may come from individual differences of sample members instead.

Research exampleYour ANOVA assesses whether the differences in mean final scores
between groups come from the differences in the frequency of quizzes or the individual
differences of the students in each group.
To do so, you get a ratio of the between-group variance of final scores and the within-group
variance of final scores – this is the F-statistic. With a large F-statistic, you find the
corresponding p-value, and conclude that the groups are significantly different from each
other.

Univariate descriptive statistics


Univariate descriptive statistics focus on only one variable at a time. It’s important to
examine data from each variable separately using multiple measures of distribution, central
tendency and spread. Programs like SPSS and Excel can be used to easily calculate these.

Visits to the library


N 6
Mean 9.5
Median 7.5
Mode 3
Standard 9.18
deviation
Variance 84.3
Range 24
If you were to only consider the mean as a measure of central tendency, your impression of
the “middle” of the data set can be skewed by outliers, unlike the median or mode.

Likewise, while the range is sensitive to outliers, you should also consider the standard
deviation and variance to get easily comparable measures of spread.

Bivariate descriptive statistics


If you’ve collected data on more than one variable, you can use bivariate or multivariate
descriptive statistics to explore whether there are relationships between them.

In bivariate analysis, you simultaneously study the frequency and variability of


two variables to see if they vary together. You can also compare the central tendency of the
two variables before performing further statistical tests.

Multivariate analysis is the same as bivariate analysis but with more than two variables.
Contingency table
In a contingency table, each cell represents the intersection of two variables. Usually,
an independent variable (e.g., gender) appears along the vertical axis and a dependent one
appears along the horizontal axis (e.g., activities). You read “across” the table to see how the
independent and dependent variables relate to each other.

Number of visits to the library in the past year

Group 0–4 5–8 9–12 13–16 17+

Childre 32 68 37 23 22
n

Adults 36 48 43 83 25

Interpreting a contingency table is easier when the raw data is converted to percentages.
Percentages make each row comparable to the other by making it seem as if each group had
only 100 observations or participants. When creating a percentage-based contingency table,
you add the N for each independent variable on the end.

Visits to the library in the past year (Percentages)

Group 0–4 5–8 9–12 13–16 17+ N

Childre 18% 37% 20% 13% 12% 182


n

Adults 15% 20% 18% 35% 11% 235

From this table, it is more clear that similar proportions of children and adults go to the
library over 17 times a year. Additionally, children most commonly went to the library
between 5 and 8 times, while for adults, this number was between 13 and 16.

Scatter plots
A scatter plot is a chart that shows you the relationship between two or three variables. It’s a
visual representation of the strength of a relationship.

In a scatter plot, you plot one variable along the x-axis and another one along the y-axis. Each
data point is represented by a point in the chart.

Scatter plot example: Library visits and movie theater visitsYou investigate whether people
who visit the library more tend to watch a movie at a theater less. You plot the number of
times participants watched movies at a theater along the x-axis and visits to the library along
the y-axis.
From your scatter plot, you see that as the number of movies seen at movie theaters increases,
the number of visits to the library decreases. Based on your visual assessment of a possible
linear relationship, you perform further tests of correlation and regression.

How to Calculate Sample Size in Excel


built-in functions
the cells containing the data . of data running from C2 to C101
maximum value =MAX
Minimum value = MIN" function.
Mean = AVERAGE(C2:C101)
Standard Deviation = STDEV(C2:C101)
N= COUNT(C2:C101)
Range = MAX(A2:A20)-MIN(A2:A20)

If your data is in a single row or column, type a colon followed by the letter and
number corresponding to the last data point and then close the parentheses to return
the minimum value. You can also do this by clicking the appropriate cell after
opening the parentheses and then holding down "Shift" and clicking the cell with
the last data point before closing the parentheses.

You might also like