Lesson 5 Notes
Lesson 5 Notes
Diploma in Data
Analysis
Confidence
in your
Sample
Lesson 5: Summary Notes
DATA ANALYSIS
2
Contents
3 Lesson 5 objectives
3 Introduction
8 Data management
10 References
DATA ANALYSIS
3
Lesson Objectives
• Sample and population
• Confidence interval
• Data management
Lesson Introduction
By the end of this lesson, you will know what the
differences between a sample and population and
why samples are so important to us. Along with
sampling, you will be introduced to confidence
intervals. And then we will end the lesson with
an interactive data management exercise on our
titanic dataset where I will show you a few more
tips and tricks to look out for.
DATA ANALYSIS
4
What happens when we want to find information about people in the country quickly? This task would be difficult
with an extremely large group. We could conduct a smaller study based on a small proportion of the population. The
proportion of the population is representative of the larger group and is known as a sample. We can use this sample
to make inferences on the general population of the country.
Population defined
A population can be defined as a large set or collection or group of items. All these items have something in
common. An example of a population could be all the students enrolled at Shaw Academy. This would be a group of
all students and the students will all have been enrolled at Shaw Academy in common.
Before we begin a study, it is important to know what our population is. We need to define the population to meet
the needs of our study. For example, if we want to conduct a study on birth weights between 2 years, we need to
define the population as the birth weight of live-born infants in the united states between years x and y. But once
again, this task of calculating the mean and variance for this population would be difficult for such a large group.
Sample defined
• A sample can be defined as a subset of the population that is selected to be representative of the larger
population. It is a set of data that is collected or selected from a population or larger group through a specific
procedure.
• Using the Shaw Academy example, a sample could be all students under the age of 25.
It isn’t practical to sample the entire population, therefore we take a sample from that population to deduce
information about the larger population. Collecting a smaller group of data is less time consuming and much
cheaper.
Sample size
It is important to consider the size of your sample given your population. For example, it’s not sensible sampling 3
people when considering the entire population of a country.
If you choose a sample that is too small a representation of your population, the confidence intervals might be very
wide, and we increase our risk of obtaining errors in the statistical hypothesis tests.
DATA ANALYSIS
5
Measures
Methods
of dispersion
of sampling
The process of collecting information from a sample is called sampling. We use sampling techniques because we
want to choose a sample that is a good representation of the population.
Random sampling
• Random sampling, also known as a probability sample, is when a simple sample size of n, is chosen in a way that
every item in the population has an equal chance of being selected to be in the sample.
• A good example would be drawing names randomly from a hat.
• This is the most common way of sampling.
• With simple random sampling, we can use statistical methods to analyse the sample results.
Systematic sampling
• Systematic sampling is when the population is ordered, and items are chosen at regular intervals such as every
5th item in the population.
• The starting point, however, is randomly generated.
• An example of this could be choosing every 10th name in the class starting from the 23rd name in the list.
Stratified sampling
• Stratified sampling is when the population is divided into subgroups, also referred to as strata, based on certain
common characteristics, like gender, age, or income bracket.
• For example, if the class has 1500 female students attending but only 300 male students and we want to ensure
the sample is an accurate reflection of the population, we could choose a random sample of each group by
selecting 150 female students and 30 male students to be a part of our sample.
Cluster sampling
DATA ANALYSIS
6
Confidence intervals refer to the probability that a population parameter (like the mean) will be between a set of
values for a certain proportion of times. They are used to measure the uncertainty or certainty of a sample method.
The most common confidence intervals are 95% and 99%. For instance, we are 95% certain that the mean will lie
between these determined values. If we sample the same population many times and achieve a point estimates on
each occasion, the confidence interval would bracket the true population parameter in approximately 95% of the
cases.
Confidence interval, therefore, contains a range of values that contain the unknown population parameter. The
interval provides us with a certainty that the population parameter falls within specific ranges.
You might have heard the phrase “we are 95% certain that group A falls within these ranges”. This statement is just
another way of saying that the population parameter of interest is contained within the 95% confidence intervals in
the vernacular.
Let’s use an example to illustrate how we interpret a 95% confidence interval on a normal distribution.
Suppose we have a flock of female sheep happily grazing in a field in France. We want to know what the mean weight
of the sheep are. We could take a random sample of 100 sheep from the population and establish a mean weight of
70kg. The mean of 70kg is a point estimate of the population mean (the entire flock of sheep’s mean).
We need to know how far this point estimate of the sample is from the true mean of the population. Therefore,
we establish a 95% confidence interval using the sample mean and standard deviation and assuming a normal
distribution. We find our interval to be between 60 - 80 kg and will arrive at an upper and lower bound that contains
the true mean 95% of the time.
This means that if we take 100 random samples from our population of sheep, the means of these should fall in the
interval 95% of the time.
We want to be able to say that we are 99% certain that the sample means will fall within a certain interval, meaning
we want greater confidence and expand the interval to 99% confidence.
DATA ANALYSIS
7
To do so, our confidence interval range will broaden to include a higher number of samples means. For instance,
where our previous confidence intervals were between 60-80kg, now the confidence interval will expand to include
60-90kg (for argument’s sake). Now, 99 out of 100 times the point estimate of the samples will fall within the
confidence interval.
Confidence intervals are often misinterpreted. A confidence interval is not the percentage of data from a given
sample that falls within the interval. We cannot say that when we draw a sample from our flock of sheep at a 95%
confidence interval of 60 - 90kg, that 95% of the data from the random sample falls between the upper and lower
bounds of the interval. To do so, we would need to know the sample mean and standard deviation and be able to
identify the sample distribution as normal.
NOTES
DATA ANALYSIS
8
Data management
Data management
Data management is important because it optimizes the way that we work, it ensures that the data is accurate and in
turn, will increase revenues and efficiency in the working environment.
Data cleaning involves removing information not needed, updating information that is incomplete, incorrect or
incomplete, checking if there are any duplicate variables, to name but a few of the steps.
Transform a variable
DATA ANALYSIS
9
2. Data type can be changed from, for example, “Whole Number” to “Decimal Number”.
3. Null values can be removed from variable by selecting the drop-down and deselecting “(null)”.
NOTES
DATA ANALYSIS
10
Resources
Glen, S., 2013, What is a Population in Statistics?, Statistics How To, https://www.statisticshowto.com/
what-is-a-population/
365 Data Science, Statistics - Population vs sample, https://365datascience.com/explainer-video/
population-vs-sample/
Engineering Statistics Handbook, 7.1.4. What are confidence intervals?, 2013,
https://www.itl.nist.gov/div898/handbook/prc/section1/prc14.htm
Kenton, W., 2020, Confidence Intervals, Investopedia, https://www.investopedia.com/terms/c/
confidenceinterval.asp#:~:text=A%20confidence%20interval%2C%20in%20statistics,certainty%20
in%20a%20sampling%20method.
Yale University: Department of Statistics, 1997, Confidence Intervals, http://www.stat.yale.edu/
Courses/1997-98/101/confint.htm
Introduction to Confidence Intervals, Lumen course: Introduction to Statistics, https://courses.
lumenlearning.com/introstats1/chapter/introduction-confidence-intervals/
Langmann, K., How to Calculate a Confidence Interval in Excel, Spreadsheeto, https://spreadsheeto.
com/confidence-interval-excel/
Description of the CONFIDENCE statistical functions in Excel, Microsoft, https://support.microsoft.com/
en-us/office/description-of-the-confidence-statistical-functions-in-excel-97f5bf0e-5d56-4f8e-8345-
2ec1dada8cd5
Carlberg, C., 2011, Statistical Analysis with Excel 2010: Using Excel with the Normal Distribution,
informIT, https://www.informit.com/articles/article.aspx?p=1717265&seqNum=3
El-Amir, H., 2019, Titanic (Step 2): Cleaning and Preprocessing, Data is Utopia, https://dataisutopia.
com/blog/preprocessing-titanic-dataset/
Srivastava, A., 2019, Cleaning the Titanic Dataset [Day 1- #30daysofML], Analytics Vidhya, https://
medium.com/analytics-vidhya/cleaning-the-titanic-dataset-day-1-30daysofml-5cc19294176b
Sisense, Data Cleaning, https://www.sisense.com/glossary/data-cleaning/
Lewandowski, P., 2018, What is data cleaning and why is it important?, sunscrapers, https://
sunscrapers.com/blog/why-is-clean-data-so-important-for-analytics-and-business-intelligence/
Microsoft, 2020, List Functions, https://docs.microsoft.com/en-us/powerquery-m/list-functions
Bassett, E.E., Bremmer, J.M., Jolliffe, I.T., et al., Statistics: Problems and Solutions, 2nd edition,
London, 1986, pp. 131-134
DATA ANALYSIS