0% found this document useful (0 votes)
3 views

Lesson 5 Notes

This document provides an overview of key concepts in data analysis, focusing on the differences between population and sample, confidence intervals, and data management techniques. It explains the importance of sampling methods and how confidence intervals can be used to estimate population parameters. Additionally, it highlights the significance of data cleaning in ensuring accurate analysis results.

Uploaded by

chandantavane99
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Lesson 5 Notes

This document provides an overview of key concepts in data analysis, focusing on the differences between population and sample, confidence intervals, and data management techniques. It explains the importance of sampling methods and how confidence intervals can be used to estimate population parameters. Additionally, it highlights the significance of data cleaning in ensuring accurate analysis results.

Uploaded by

chandantavane99
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

1

Diploma in Data
Analysis

Confidence
in your
Sample
Lesson 5: Summary Notes

DATA ANALYSIS
2

Contents

3 Lesson 5 objectives

3 Introduction

4 Population and sample

6 Confidence interval and limits

8 Data management

10 References

DATA ANALYSIS
3

Lesson Objectives
• Sample and population
• Confidence interval
• Data management

Lesson Introduction
By the end of this lesson, you will know what the
differences between a sample and population and
why samples are so important to us. Along with
sampling, you will be introduced to confidence
intervals. And then we will end the lesson with
an interactive data management exercise on our
titanic dataset where I will show you a few more
tips and tricks to look out for.

Did you know?


There has been an ongoing boom of career opportunities in the last few years. A while ago, it was
estimated by Mckinsey that the United States alone faces a shortage of 140 000 - 190 000 employees
with analytics expertise. Almost every industry you can think of that uses data, will require data
analysts to analyse it.

DATA ANALYSIS
4

Population and sample


Imagine we want to gather information about people living in a country. Every 10 years governments must conduct a
census which collects information from every person living in that country. As you can imagine this can be very time
and resource consuming.

What happens when we want to find information about people in the country quickly? This task would be difficult
with an extremely large group. We could conduct a smaller study based on a small proportion of the population. The
proportion of the population is representative of the larger group and is known as a sample. We can use this sample
to make inferences on the general population of the country.

Population defined

A population can be defined as a large set or collection or group of items. All these items have something in
common. An example of a population could be all the students enrolled at Shaw Academy. This would be a group of
all students and the students will all have been enrolled at Shaw Academy in common.

Before we begin a study, it is important to know what our population is. We need to define the population to meet
the needs of our study. For example, if we want to conduct a study on birth weights between 2 years, we need to
define the population as the birth weight of live-born infants in the united states between years x and y. But once
again, this task of calculating the mean and variance for this population would be difficult for such a large group.

Sample defined

• A sample can be defined as a subset of the population that is selected to be representative of the larger
population. It is a set of data that is collected or selected from a population or larger group through a specific
procedure.
• Using the Shaw Academy example, a sample could be all students under the age of 25.

It isn’t practical to sample the entire population, therefore we take a sample from that population to deduce
information about the larger population. Collecting a smaller group of data is less time consuming and much
cheaper.

Sample size

It is important to consider the size of your sample given your population. For example, it’s not sensible sampling 3
people when considering the entire population of a country.

If you choose a sample that is too small a representation of your population, the confidence intervals might be very
wide, and we increase our risk of obtaining errors in the statistical hypothesis tests.

It is not always possible to sample a bigger sample due to resource restraints.


It is important to remember that larger sample sizes will generally be more accurate in estimating unknown
parameters about the population and depending on the precision we want; we select a sample size.

DATA ANALYSIS
5

Measures
Methods
of dispersion
of sampling

The process of collecting information from a sample is called sampling. We use sampling techniques because we
want to choose a sample that is a good representation of the population.

A few sampling techniques include:


Random sampling, systematic sampling, stratified sampling, and cluster sampling.

Random sampling

• Random sampling, also known as a probability sample, is when a simple sample size of n, is chosen in a way that
every item in the population has an equal chance of being selected to be in the sample.
• A good example would be drawing names randomly from a hat.
• This is the most common way of sampling.
• With simple random sampling, we can use statistical methods to analyse the sample results.

Systematic sampling

• Systematic sampling is when the population is ordered, and items are chosen at regular intervals such as every
5th item in the population.
• The starting point, however, is randomly generated.
• An example of this could be choosing every 10th name in the class starting from the 23rd name in the list.

Stratified sampling

• Stratified sampling is when the population is divided into subgroups, also referred to as strata, based on certain
common characteristics, like gender, age, or income bracket.
• For example, if the class has 1500 female students attending but only 300 male students and we want to ensure
the sample is an accurate reflection of the population, we could choose a random sample of each group by
selecting 150 female students and 30 male students to be a part of our sample.

Cluster sampling

• Cluster sample is when the population is divided into clusters or subgroups.


• The groups all have similar characteristics.
• The subgroups are then randomly selected and all the items in that chosen cluster are selected for the sample.
• This is a good technique for us for large populations that are extremely diverse.

DATA ANALYSIS
6

Confidence intervals and limits


Suppose you wanted to know the average number of chips when you order medium fries at McDonald’s. If you
ordered a few, tallied the total number of chips and divided them by the number of fries ordered you would have
obtained what is called a point estimate of the true mean. We can use sample data to make observations about
an unknown population. This is called inferential statistics where we use sample data to make an estimate of a
population parameter. We know it isn’t going to be the exact value, but we can get close to it. We can calculate point
estimates from which we can construct interval estimates known as confidence intervals. The confidence interval
provides us with a range of values which are likely to contain the population parameter that we are interested in.

Confidence intervals defined

Confidence intervals refer to the probability that a population parameter (like the mean) will be between a set of
values for a certain proportion of times. They are used to measure the uncertainty or certainty of a sample method.
The most common confidence intervals are 95% and 99%. For instance, we are 95% certain that the mean will lie
between these determined values. If we sample the same population many times and achieve a point estimates on
each occasion, the confidence interval would bracket the true population parameter in approximately 95% of the
cases.

Confidence interval, therefore, contains a range of values that contain the unknown population parameter. The
interval provides us with a certainty that the population parameter falls within specific ranges.

Did you know: In the vernacular

You might have heard the phrase “we are 95% certain that group A falls within these ranges”. This statement is just
another way of saying that the population parameter of interest is contained within the 95% confidence intervals in
the vernacular.

Interpreting confidence intervals

Let’s use an example to illustrate how we interpret a 95% confidence interval on a normal distribution.
Suppose we have a flock of female sheep happily grazing in a field in France. We want to know what the mean weight
of the sheep are. We could take a random sample of 100 sheep from the population and establish a mean weight of
70kg. The mean of 70kg is a point estimate of the population mean (the entire flock of sheep’s mean).

We need to know how far this point estimate of the sample is from the true mean of the population. Therefore,
we establish a 95% confidence interval using the sample mean and standard deviation and assuming a normal
distribution. We find our interval to be between 60 - 80 kg and will arrive at an upper and lower bound that contains
the true mean 95% of the time.

This means that if we take 100 random samples from our population of sheep, the means of these should fall in the
interval 95% of the time.

We want to be able to say that we are 99% certain that the sample means will fall within a certain interval, meaning
we want greater confidence and expand the interval to 99% confidence.

DATA ANALYSIS
7

To do so, our confidence interval range will broaden to include a higher number of samples means. For instance,
where our previous confidence intervals were between 60-80kg, now the confidence interval will expand to include
60-90kg (for argument’s sake). Now, 99 out of 100 times the point estimate of the samples will fall within the
confidence interval.

Misconceptions/misunderstandings about confidence intervals

Confidence intervals are often misinterpreted. A confidence interval is not the percentage of data from a given
sample that falls within the interval. We cannot say that when we draw a sample from our flock of sheep at a 95%
confidence interval of 60 - 90kg, that 95% of the data from the random sample falls between the upper and lower
bounds of the interval. To do so, we would need to know the sample mean and standard deviation and be able to
identify the sample distribution as normal.

NOTES

DATA ANALYSIS
8

Data management

Did you know?


Did you know that analytics professionals can spend up to 60% of their time on data management? Data cleaning
is a foundational step to the data analysis process. The step is important to ensure that the results we generate are
accurate. When data is drawn from many different sources, this information often carries mistakes.

Data management

Data management is important because it optimizes the way that we work, it ensures that the data is accurate and in
turn, will increase revenues and efficiency in the working environment.

Data cleaning involves removing information not needed, updating information that is incomplete, incorrect or
incomplete, checking if there are any duplicate variables, to name but a few of the steps.

Transform a variable

1. Decimal values read in with points can be replaced with a comma.

DATA ANALYSIS
9

2. Data type can be changed from, for example, “Whole Number” to “Decimal Number”.

3. Null values can be removed from variable by selecting the drop-down and deselecting “(null)”.

NOTES

DATA ANALYSIS
10

Resources
Glen, S., 2013, What is a Population in Statistics?, Statistics How To, https://www.statisticshowto.com/
what-is-a-population/
365 Data Science, Statistics - Population vs sample, https://365datascience.com/explainer-video/
population-vs-sample/
Engineering Statistics Handbook, 7.1.4. What are confidence intervals?, 2013,
https://www.itl.nist.gov/div898/handbook/prc/section1/prc14.htm
Kenton, W., 2020, Confidence Intervals, Investopedia, https://www.investopedia.com/terms/c/
confidenceinterval.asp#:~:text=A%20confidence%20interval%2C%20in%20statistics,certainty%20
in%20a%20sampling%20method.
Yale University: Department of Statistics, 1997, Confidence Intervals, http://www.stat.yale.edu/
Courses/1997-98/101/confint.htm
Introduction to Confidence Intervals, Lumen course: Introduction to Statistics, https://courses.
lumenlearning.com/introstats1/chapter/introduction-confidence-intervals/
Langmann, K., How to Calculate a Confidence Interval in Excel, Spreadsheeto, https://spreadsheeto.
com/confidence-interval-excel/
Description of the CONFIDENCE statistical functions in Excel, Microsoft, https://support.microsoft.com/
en-us/office/description-of-the-confidence-statistical-functions-in-excel-97f5bf0e-5d56-4f8e-8345-
2ec1dada8cd5
Carlberg, C., 2011, Statistical Analysis with Excel 2010: Using Excel with the Normal Distribution,
informIT, https://www.informit.com/articles/article.aspx?p=1717265&seqNum=3
El-Amir, H., 2019, Titanic (Step 2): Cleaning and Preprocessing, Data is Utopia, https://dataisutopia.
com/blog/preprocessing-titanic-dataset/
Srivastava, A., 2019, Cleaning the Titanic Dataset [Day 1- #30daysofML], Analytics Vidhya, https://
medium.com/analytics-vidhya/cleaning-the-titanic-dataset-day-1-30daysofml-5cc19294176b
Sisense, Data Cleaning, https://www.sisense.com/glossary/data-cleaning/
Lewandowski, P., 2018, What is data cleaning and why is it important?, sunscrapers, https://
sunscrapers.com/blog/why-is-clean-data-so-important-for-analytics-and-business-intelligence/
Microsoft, 2020, List Functions, https://docs.microsoft.com/en-us/powerquery-m/list-functions
Bassett, E.E., Bremmer, J.M., Jolliffe, I.T., et al., Statistics: Problems and Solutions, 2nd edition,
London, 1986, pp. 131-134

DATA ANALYSIS

You might also like