0% found this document useful (0 votes)

5 views10 pages

Lesson 5 Notes

This document provides an overview of key concepts in data analysis, focusing on the differences between population and sample, confidence intervals, and data management techniques. It explains the importance of sampling methods and how confidence intervals can be used to estimate population parameters. Additionally, it highlights the significance of data cleaning in ensuring accurate analysis results.

Uploaded by

chandantavane99

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views10 pages

Lesson 5 Notes

Uploaded by

chandantavane99

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

1

Diploma in Data
Analysis

Confidence
in your
Sample
Lesson 5: Summary Notes

DATA ANALYSIS
2

Contents

3 Lesson 5 objectives

3 Introduction

4 Population and sample

6 Confidence interval and limits

8 Data management

10 References

DATA ANALYSIS
3

Lesson Objectives
• Sample and population
• Confidence interval
• Data management

Lesson Introduction
By the end of this lesson, you will know what the
differences between a sample and population and
why samples are so important to us. Along with
sampling, you will be introduced to confidence
intervals. And then we will end the lesson with
an interactive data management exercise on our
titanic dataset where I will show you a few more
tips and tricks to look out for.

Did you know?

There has been an ongoing boom of career opportunities in the last few years. A while ago, it was
estimated by Mckinsey that the United States alone faces a shortage of 140 000 - 190 000 employees
with analytics expertise. Almost every industry you can think of that uses data, will require data
analysts to analyse it.

DATA ANALYSIS
4

Population and sample

Imagine we want to gather information about people living in a country. Every 10 years governments must conduct a
census which collects information from every person living in that country. As you can imagine this can be very time
and resource consuming.

What happens when we want to find information about people in the country quickly? This task would be difficult
with an extremely large group. We could conduct a smaller study based on a small proportion of the population. The
proportion of the population is representative of the larger group and is known as a sample. We can use this sample
to make inferences on the general population of the country.

Population defined

A population can be defined as a large set or collection or group of items. All these items have something in
common. An example of a population could be all the students enrolled at Shaw Academy. This would be a group of
all students and the students will all have been enrolled at Shaw Academy in common.

Before we begin a study, it is important to know what our population is. We need to define the population to meet
the needs of our study. For example, if we want to conduct a study on birth weights between 2 years, we need to
define the population as the birth weight of live-born infants in the united states between years x and y. But once
again, this task of calculating the mean and variance for this population would be difficult for such a large group.

Sample defined

• A sample can be defined as a subset of the population that is selected to be representative of the larger
population. It is a set of data that is collected or selected from a population or larger group through a specific
procedure.
• Using the Shaw Academy example, a sample could be all students under the age of 25.

It isn’t practical to sample the entire population, therefore we take a sample from that population to deduce
information about the larger population. Collecting a smaller group of data is less time consuming and much
cheaper.

Sample size

It is important to consider the size of your sample given your population. For example, it’s not sensible sampling 3
people when considering the entire population of a country.

If you choose a sample that is too small a representation of your population, the confidence intervals might be very
wide, and we increase our risk of obtaining errors in the statistical hypothesis tests.

It is not always possible to sample a bigger sample due to resource restraints.

It is important to remember that larger sample sizes will generally be more accurate in estimating unknown
parameters about the population and depending on the precision we want; we select a sample size.

DATA ANALYSIS
5

Measures
Methods
of dispersion
of sampling

The process of collecting information from a sample is called sampling. We use sampling techniques because we
want to choose a sample that is a good representation of the population.

A few sampling techniques include:

Random sampling, systematic sampling, stratified sampling, and cluster sampling.

Random sampling

• Random sampling, also known as a probability sample, is when a simple sample size of n, is chosen in a way that
every item in the population has an equal chance of being selected to be in the sample.
• A good example would be drawing names randomly from a hat.
• This is the most common way of sampling.
• With simple random sampling, we can use statistical methods to analyse the sample results.

Systematic sampling

• Systematic sampling is when the population is ordered, and items are chosen at regular intervals such as every
5th item in the population.
• The starting point, however, is randomly generated.
• An example of this could be choosing every 10th name in the class starting from the 23rd name in the list.

Stratified sampling

• Stratified sampling is when the population is divided into subgroups, also referred to as strata, based on certain
common characteristics, like gender, age, or income bracket.
• For example, if the class has 1500 female students attending but only 300 male students and we want to ensure
the sample is an accurate reflection of the population, we could choose a random sample of each group by
selecting 150 female students and 30 male students to be a part of our sample.

Cluster sampling

• Cluster sample is when the population is divided into clusters or subgroups.

• The groups all have similar characteristics.
• The subgroups are then randomly selected and all the items in that chosen cluster are selected for the sample.
• This is a good technique for us for large populations that are extremely diverse.

DATA ANALYSIS
6

Confidence intervals and limits

Suppose you wanted to know the average number of chips when you order medium fries at McDonald’s. If you
ordered a few, tallied the total number of chips and divided them by the number of fries ordered you would have
obtained what is called a point estimate of the true mean. We can use sample data to make observations about
an unknown population. This is called inferential statistics where we use sample data to make an estimate of a
population parameter. We know it isn’t going to be the exact value, but we can get close to it. We can calculate point
estimates from which we can construct interval estimates known as confidence intervals. The confidence interval
provides us with a range of values which are likely to contain the population parameter that we are interested in.

Confidence intervals defined

Confidence intervals refer to the probability that a population parameter (like the mean) will be between a set of
values for a certain proportion of times. They are used to measure the uncertainty or certainty of a sample method.
The most common confidence intervals are 95% and 99%. For instance, we are 95% certain that the mean will lie
between these determined values. If we sample the same population many times and achieve a point estimates on
each occasion, the confidence interval would bracket the true population parameter in approximately 95% of the
cases.

Confidence interval, therefore, contains a range of values that contain the unknown population parameter. The
interval provides us with a certainty that the population parameter falls within specific ranges.

Did you know: In the vernacular

You might have heard the phrase “we are 95% certain that group A falls within these ranges”. This statement is just
another way of saying that the population parameter of interest is contained within the 95% confidence intervals in
the vernacular.

Interpreting confidence intervals

Let’s use an example to illustrate how we interpret a 95% confidence interval on a normal distribution.
Suppose we have a flock of female sheep happily grazing in a field in France. We want to know what the mean weight
of the sheep are. We could take a random sample of 100 sheep from the population and establish a mean weight of
70kg. The mean of 70kg is a point estimate of the population mean (the entire flock of sheep’s mean).

We need to know how far this point estimate of the sample is from the true mean of the population. Therefore,
we establish a 95% confidence interval using the sample mean and standard deviation and assuming a normal
distribution. We find our interval to be between 60 - 80 kg and will arrive at an upper and lower bound that contains
the true mean 95% of the time.

This means that if we take 100 random samples from our population of sheep, the means of these should fall in the
interval 95% of the time.

We want to be able to say that we are 99% certain that the sample means will fall within a certain interval, meaning
we want greater confidence and expand the interval to 99% confidence.

DATA ANALYSIS
7

To do so, our confidence interval range will broaden to include a higher number of samples means. For instance,
where our previous confidence intervals were between 60-80kg, now the confidence interval will expand to include
60-90kg (for argument’s sake). Now, 99 out of 100 times the point estimate of the samples will fall within the
confidence interval.

Misconceptions/misunderstandings about confidence intervals

Confidence intervals are often misinterpreted. A confidence interval is not the percentage of data from a given
sample that falls within the interval. We cannot say that when we draw a sample from our flock of sheep at a 95%
confidence interval of 60 - 90kg, that 95% of the data from the random sample falls between the upper and lower
bounds of the interval. To do so, we would need to know the sample mean and standard deviation and be able to
identify the sample distribution as normal.

NOTES

DATA ANALYSIS
8

Data management

Did you know?

Did you know that analytics professionals can spend up to 60% of their time on data management? Data cleaning
is a foundational step to the data analysis process. The step is important to ensure that the results we generate are
accurate. When data is drawn from many different sources, this information often carries mistakes.

Data management

Data management is important because it optimizes the way that we work, it ensures that the data is accurate and in
turn, will increase revenues and efficiency in the working environment.

Data cleaning involves removing information not needed, updating information that is incomplete, incorrect or
incomplete, checking if there are any duplicate variables, to name but a few of the steps.

Transform a variable

1. Decimal values read in with points can be replaced with a comma.

DATA ANALYSIS
9

2. Data type can be changed from, for example, “Whole Number” to “Decimal Number”.

3. Null values can be removed from variable by selecting the drop-down and deselecting “(null)”.

NOTES

DATA ANALYSIS
10

Resources
Glen, S., 2013, What is a Population in Statistics?, Statistics How To, https://www.statisticshowto.com/
what-is-a-population/
365 Data Science, Statistics - Population vs sample, https://365datascience.com/explainer-video/
population-vs-sample/
Engineering Statistics Handbook, 7.1.4. What are confidence intervals?, 2013,
https://www.itl.nist.gov/div898/handbook/prc/section1/prc14.htm
Kenton, W., 2020, Confidence Intervals, Investopedia, https://www.investopedia.com/terms/c/
confidenceinterval.asp#:~:text=A%20confidence%20interval%2C%20in%20statistics,certainty%20
in%20a%20sampling%20method.
Yale University: Department of Statistics, 1997, Confidence Intervals, http://www.stat.yale.edu/
Courses/1997-98/101/confint.htm
Introduction to Confidence Intervals, Lumen course: Introduction to Statistics, https://courses.
lumenlearning.com/introstats1/chapter/introduction-confidence-intervals/
Langmann, K., How to Calculate a Confidence Interval in Excel, Spreadsheeto, https://spreadsheeto.
com/confidence-interval-excel/
Description of the CONFIDENCE statistical functions in Excel, Microsoft, https://support.microsoft.com/
en-us/office/description-of-the-confidence-statistical-functions-in-excel-97f5bf0e-5d56-4f8e-8345-
2ec1dada8cd5
Carlberg, C., 2011, Statistical Analysis with Excel 2010: Using Excel with the Normal Distribution,
informIT, https://www.informit.com/articles/article.aspx?p=1717265&seqNum=3
El-Amir, H., 2019, Titanic (Step 2): Cleaning and Preprocessing, Data is Utopia, https://dataisutopia.
com/blog/preprocessing-titanic-dataset/
Srivastava, A., 2019, Cleaning the Titanic Dataset [Day 1- #30daysofML], Analytics Vidhya, https://
medium.com/analytics-vidhya/cleaning-the-titanic-dataset-day-1-30daysofml-5cc19294176b
Sisense, Data Cleaning, https://www.sisense.com/glossary/data-cleaning/
Lewandowski, P., 2018, What is data cleaning and why is it important?, sunscrapers, https://
sunscrapers.com/blog/why-is-clean-data-so-important-for-analytics-and-business-intelligence/
Microsoft, 2020, List Functions, https://docs.microsoft.com/en-us/powerquery-m/list-functions
Bassett, E.E., Bremmer, J.M., Jolliffe, I.T., et al., Statistics: Problems and Solutions, 2nd edition,
London, 1986, pp. 131-134

DATA ANALYSIS

Excelente - MATLAB - Design, Modeling and Evaluation of Protective Relays For Power Systems PDF
100% (2)
Excelente - MATLAB - Design, Modeling and Evaluation of Protective Relays For Power Systems PDF
316 pages
Alyssa Mcmahon Evidence Chart 2023
No ratings yet
Alyssa Mcmahon Evidence Chart 2023
8 pages
Methods of Probability Sampling
No ratings yet
Methods of Probability Sampling
16 pages
RT Procedure Rev01E
No ratings yet
RT Procedure Rev01E
20 pages
03 Komatsu GD825 Machine Maintenance PDF
100% (3)
03 Komatsu GD825 Machine Maintenance PDF
50 pages
F - Boring
100% (1)
F - Boring
44 pages
Lectorial Slides 6a
No ratings yet
Lectorial Slides 6a
30 pages
Sampling and Estimation
No ratings yet
Sampling and Estimation
34 pages
Session On Confidence Interval
No ratings yet
Session On Confidence Interval
13 pages
Statistics Definition of Terms
No ratings yet
Statistics Definition of Terms
47 pages
Topic 3 - ETC1000
No ratings yet
Topic 3 - ETC1000
10 pages
Stats-And-Prob-Reviewer (Grade 11 Stem)
100% (1)
Stats-And-Prob-Reviewer (Grade 11 Stem)
5 pages
Sampling.pptx
No ratings yet
Sampling.pptx
15 pages
SP Reviewer
No ratings yet
SP Reviewer
4 pages
1.Population and Sample
No ratings yet
1.Population and Sample
9 pages
Unit - 1 Introduction-Statistical Inference (1)
No ratings yet
Unit - 1 Introduction-Statistical Inference (1)
28 pages
Business Modelling Confidence Intervals: Prof Baibing Li BE 1.26 E-Mail: Tel 228841
No ratings yet
Business Modelling Confidence Intervals: Prof Baibing Li BE 1.26 E-Mail: Tel 228841
11 pages
7 Sample Design and Sampling
No ratings yet
7 Sample Design and Sampling
36 pages
Essentials of Statistics L 06
No ratings yet
Essentials of Statistics L 06
6 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
18 pages
Presentation 1
No ratings yet
Presentation 1
88 pages
Math
No ratings yet
Math
2 pages
Lecture 2
No ratings yet
Lecture 2
65 pages
SB K49 Lecture7
No ratings yet
SB K49 Lecture7
57 pages
Reviewer_in_Statistics_and_Probability
No ratings yet
Reviewer_in_Statistics_and_Probability
7 pages
Statistics Chapter2
No ratings yet
Statistics Chapter2
102 pages
Sampling and Estimation
No ratings yet
Sampling and Estimation
15 pages
The Law of Carriage of Goods by Sea and Marine Insurance
No ratings yet
The Law of Carriage of Goods by Sea and Marine Insurance
8 pages
- Module 4-Sampling 2
No ratings yet
- Module 4-Sampling 2
56 pages
Spreadsheet Module 4 Fin
No ratings yet
Spreadsheet Module 4 Fin
49 pages
Math236_Lecture_1 (1)
No ratings yet
Math236_Lecture_1 (1)
47 pages
Brief Lecture Notes
No ratings yet
Brief Lecture Notes
13 pages
Inferential Statistics
No ratings yet
Inferential Statistics
119 pages
Inferential Statistics: by The End of This Chapter You Should Be Able To
No ratings yet
Inferential Statistics: by The End of This Chapter You Should Be Able To
46 pages
Sampling & Sample Size (CRK)
No ratings yet
Sampling & Sample Size (CRK)
12 pages
RMB W2
No ratings yet
RMB W2
22 pages
Module 4 (301 SI-2) (1)
No ratings yet
Module 4 (301 SI-2) (1)
24 pages
Chap 9
No ratings yet
Chap 9
9 pages
Business Analytics-Iii Ism-Unit 2 - Sampling Methods and Estimation
No ratings yet
Business Analytics-Iii Ism-Unit 2 - Sampling Methods and Estimation
5 pages
Lesson+1+Introduction+to+Statistics
No ratings yet
Lesson+1+Introduction+to+Statistics
12 pages
Sampling and Estimation
No ratings yet
Sampling and Estimation
36 pages
Sampling Distribution CI HyT
No ratings yet
Sampling Distribution CI HyT
98 pages
4th Quarter Notes - Probstat 1
No ratings yet
4th Quarter Notes - Probstat 1
36 pages
Inferential Statistics
No ratings yet
Inferential Statistics
169 pages
Stat Notes
No ratings yet
Stat Notes
5 pages
UNIT 10 - Estimations (with voice)
No ratings yet
UNIT 10 - Estimations (with voice)
67 pages
What Is Statistical Sampling
No ratings yet
What Is Statistical Sampling
5 pages
�Untitled copy
No ratings yet
�Untitled copy
129 pages
Lecture 03 Probability and Statistics Review Part2
No ratings yet
Lecture 03 Probability and Statistics Review Part2
74 pages
University of Gondar College of Medicine and Health Science Department of Epidemiology and Biostatistics
No ratings yet
University of Gondar College of Medicine and Health Science Department of Epidemiology and Biostatistics
119 pages
ABM Summary A and C
No ratings yet
ABM Summary A and C
315 pages
Quantitative Techniques by Amit Ramawat
No ratings yet
Quantitative Techniques by Amit Ramawat
26 pages
Q3 Mod 4
No ratings yet
Q3 Mod 4
8 pages
Research Methods: Dr. Abeer Yasin
No ratings yet
Research Methods: Dr. Abeer Yasin
109 pages
Unit 2-2 Sampling Design
No ratings yet
Unit 2-2 Sampling Design
26 pages
9a BMGT 220 S.I. Theory of Estimation
No ratings yet
9a BMGT 220 S.I. Theory of Estimation
5 pages
Lecture 1
No ratings yet
Lecture 1
65 pages
Lecture 2-Data Collection & Sample Design
No ratings yet
Lecture 2-Data Collection & Sample Design
33 pages
1 Review of Basic Concepts - Interval Estimation
No ratings yet
1 Review of Basic Concepts - Interval Estimation
4 pages
Stats Assign
No ratings yet
Stats Assign
6 pages
Introduction to Biostatistics
No ratings yet
Introduction to Biostatistics
67 pages
Sampling in Statistics
From Everand
Sampling in Statistics
Stephanie Glen
No ratings yet
Machine Learning - A Complete Exploration of Highly Advanced Machine Learning Concepts, Best Practices and Techniques: 4
From Everand
Machine Learning - A Complete Exploration of Highly Advanced Machine Learning Concepts, Best Practices and Techniques: 4
Peter Bradley
No ratings yet
Surviving Statistics: A Professor's Guide to Getting Through
From Everand
Surviving Statistics: A Professor's Guide to Getting Through
Luther Maddy
No ratings yet
SPSS for you
From Everand
SPSS for you
A Rajathi
4.5/5 (4)
Cellulase Production by SSF
No ratings yet
Cellulase Production by SSF
9 pages
Page 133 (Get Smart Plus 4)
No ratings yet
Page 133 (Get Smart Plus 4)
2 pages
History of Aerospace
No ratings yet
History of Aerospace
81 pages
General Manager
No ratings yet
General Manager
2 pages
Engine Data
No ratings yet
Engine Data
2 pages
A Society That Looks Back in Anger Studying John Osborne
No ratings yet
A Society That Looks Back in Anger Studying John Osborne
6 pages
Electron JS
No ratings yet
Electron JS
21 pages
Learn Rails 2
100% (1)
Learn Rails 2
408 pages
Research Proposal Sample
No ratings yet
Research Proposal Sample
4 pages
Altair 240 - 4pp Brochure
No ratings yet
Altair 240 - 4pp Brochure
4 pages
hw1 Soln
No ratings yet
hw1 Soln
11 pages
Department of Education: Sergia Soriano Esteban Integrated School Ii
No ratings yet
Department of Education: Sergia Soriano Esteban Integrated School Ii
8 pages
Stiffness Tester
No ratings yet
Stiffness Tester
6 pages
Boundary Layer Thickness
No ratings yet
Boundary Layer Thickness
23 pages
Ae ZG510 Course Handout
No ratings yet
Ae ZG510 Course Handout
7 pages
Control Engineering MCQ
50% (2)
Control Engineering MCQ
4 pages
Application: This Paper Presented New Methodology of Intelligent Smart Door System To Accomplish The Goal of Real
No ratings yet
Application: This Paper Presented New Methodology of Intelligent Smart Door System To Accomplish The Goal of Real
2 pages
MS Word - Digital Presentation
No ratings yet
MS Word - Digital Presentation
13 pages
Hasil Pengolahan Data Dengan Aplikasi SPSS Versi 21
No ratings yet
Hasil Pengolahan Data Dengan Aplikasi SPSS Versi 21
17 pages
Digital Unit Plan Template Unit Title: Kinetic Molecular Theory of Gases Name: Jill Jermain Content Area: Chemistry Grade Level: 10 Grade
No ratings yet
Digital Unit Plan Template Unit Title: Kinetic Molecular Theory of Gases Name: Jill Jermain Content Area: Chemistry Grade Level: 10 Grade
4 pages
Conservation Biology - Andrew S.pullin
No ratings yet
Conservation Biology - Andrew S.pullin
9,764 pages
Combinepdf (4) Removed Removed
No ratings yet
Combinepdf (4) Removed Removed
97 pages
Thrive: Solar LED Home Lighting System
No ratings yet
Thrive: Solar LED Home Lighting System
2 pages
Electrical Diagram of the Model «2840TE» With Engine 123.093, Starting With Chassis No. 007 619 (Mercedes-Benz W123 1976-1985_ Electrical Equipment_ Electrical Circuits)
No ratings yet
Electrical Diagram of the Model «2840TE» With Engine 123.093, Starting With Chassis No. 007 619 (Mercedes-Benz W123 1976-1985_ Electrical Equipment_ Electrical Circuits)
10 pages
GA& WIRING DRAWING OF MCC PANEL
No ratings yet
GA& WIRING DRAWING OF MCC PANEL
6 pages

Lesson 5 Notes

Uploaded by

Lesson 5 Notes

Uploaded by

1

4 Population and sample

6 Confidence interval and limits

Did you know?

Population and sample

It is not always possible to sample a bigger sample due to resource restraints.

A few sampling techniques include:

• Cluster sample is when the population is divided into clusters or subgroups.

Confidence intervals and limits

Confidence intervals defined

Did you know: In the vernacular

Interpreting confidence intervals

Misconceptions/misunderstandings about confidence intervals

Did you know?

1. Decimal values read in with points can be replaced with a comma.

You might also like