0% found this document useful (0 votes)

20 views41 pages

Bootstrapping and Sampling in R

Uploaded by

zopauy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views41 pages

Bootstrapping and Sampling in R

Uploaded by

zopauy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Introduction to

bootstrapping
SAMPLING IN R

Richie Cotton
Data Evangelist at DataCamp
With or without
Sampling without replacement Sampling with replacement ("resampling")

SAMPLING IN R
Simple random sampling without replacement
Population Sample

SAMPLING IN R
Simple random sampling with replacement
Population Sample

SAMPLING IN R
Why sample with replacement?
Think of the coffee_ratings data as being a sample of a larger population of all coffees.

Think about each coffee in our sample as being representative of many different coffees
that we don't have in our sample, but do exist in the population.

Sampling with replacement is a proxy for including different members of these groups in our
sample.

SAMPLING IN R
Coffee data preparation
coffee_focus <- coffee_ratings %>%
select(variety, country_of_origin, flavor) %>%
rowid_to_column()

glimpse(coffee_focus)

Rows: 1,338
Columns: 4
$ rowid <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...
$ variety <chr> NA, "Other", "Bourbon", NA, "Other", NA, "Other", N...
$ country_of_origin <chr> "Ethiopia", "Ethiopia", "Guatemala", "Ethiopia", "E...
$ flavor <dbl> 8.83, 8.67, 8.50, 8.58, 8.50, 8.42, 8.50, 8.33, 8.6...

SAMPLING IN R
Resampling with slice_sample()
coffee_resamp <- coffee_focus %>% # A tibble: 1,338 x 4
slice_sample(prop = 1, replace = TRUE) rowid variety country_of_origin flavor
<int> <chr> <chr> <dbl>
1 1253 Bourbon Guatemala 6.92
2 186 Caturra Colombia 7.58
3 1185 Bourbon Guatemala 7.42
4 1273 NA Philippines 6.5
5 1042 Caturra Honduras 7.33
6 195 Caturra Guatemala 7.75
7 1219 Typica Mexico 7
8 952 Caturra Honduras 7.5
9 41 Caturra Thailand 8.33
10 460 Caturra Honduras 7.67
# ... with 1,328 more rows

SAMPLING IN R
Repeated coffees
coffee_resamp %>% # A tibble: 844 x 2
count(rowid, sort = TRUE) rowid n
<int> <int>
1 704 5
2 913 5
3 1070 5
4 16 4
5 180 4
6 230 4
7 234 4
8 342 4
9 354 4
10 423 4
# ... with 834 more rows

SAMPLING IN R
Missing coffees
coffee_resamp %>%
summarize(
coffees_included = n_distinct(rowid),
coffees_not_included = n() - coffees_included
)

# A tibble: 1 x 2
coffees_included coffees_not_included
<int> <int>
1 844 494

SAMPLING IN R
Bootstrapping
The opposite of sampling from a
population.

Sampling: going from a population to a

smaller sample.

Bootstrapping: building up a theoretical

population from your sample.

Bootstrapping use case

Develop understanding of sampling

variability using a single sample.

SAMPLING IN R
Bootstrapping process
1. Make a resample of the same size as the original sample.
2. Calculate the statistic of interest for this bootstrap sample.

3. Repeat steps 1 and 2 many times.

The resulting statistics are called bootstrap statistics and when viewed to see their variability
a bootstrap distribution.

SAMPLING IN R
Bootstrapping coffee mean flavor
# Step 3. Repeat many times
mean_flavors_1000 <- replicate(
n = 1000,
expr = {

coffee_focus %>%
# Step 1. Resample
slice_sample(prop = 1, replace = TRUE) %>%

# Step 2. Calculate statistic

summarize(mean_flavor = mean(flavor, [Link] = TRUE)) %>%
pull(mean_flavor)

})

SAMPLING IN R
Bootstrap distribution histogram
bootstrap_distn <- tibble(
resample_mean = mean_flavors_1000
)

ggplot(bootstrap_distn, aes(resample_mean)) +
geom_histogram(binwidth = 0.0025)

SAMPLING IN R
Let's practice!
SAMPLING IN R
Comparing
sampling and
bootstrap
distributions
SAMPLING IN R

Richie Cotton
Data Evangelist at DataCamp
Coffee focused subset
[Link](19790801)
coffee_sample <- coffee_ratings %>%
select(variety, country_of_origin, flavor) %>%
rowid_to_column() %>%
slice_sample(n = 500)
glimpse(coffee_sample)

Rows: 500
Columns: 4
$ rowid <int> 10, 278, 458, 622, 131, 385, 1292, 47, 904, 1020, 5...
$ variety <chr> "Other", "Bourbon", NA, "Caturra", "Caturra", "Yell...
$ country_of_origin <chr> "Ethiopia", "Guatemala", "Colombia", "Thailand", "C...
$ flavor <dbl> 8.58, 7.75, 7.75, 7.50, 8.00, 7.83, 7.17, 8.08, 7.3...

SAMPLING IN R
The bootstrap of mean coffee flavors
mean_flavors_1000 <- replicate(
n = 1000,
expr = coffee_sample %>%
slice_sample(prop = 1, replace = TRUE) %>%
summarize(mean_flavor = mean(flavor, [Link] = TRUE)) %>%
pull(mean_flavor)
)
bootstrap_distn <- tibble(
resample_mean = mean_flavors_1000
)

SAMPLING IN R
Mean flavor bootstrap distribution
ggplot(bootstrap_distn, aes(resample_mean)) +
geom_histogram(binwidth = 0.0025)

SAMPLING IN R
Sample, bootstrap distribution, population means
Sample mean Estimated population mean

coffee_sample %>% bootstrap_distn %>%

summarize(mean_flavor = mean(flavor)) %>% summarize(mean_mean_flavor = mean(resample_mean)) %>%
pull(mean_flavor) pull(mean_mean_flavor)

7.5163 7.5167

True population mean

coffee_ratings %>%
summarize(mean_flavor = mean(flavor)) %>%
pull(mean_flavor)

7.5260

SAMPLING IN R
Interpreting the means
The bootstrap distribution mean is usually almost identical to the sample mean.
It may not be a good estimate of the population mean.

Bootstrapping cannot correct biases due to differences between your sample and the
population.

SAMPLING IN R
Sample sd vs bootstrap distribution sd
Sample standard deviation Estimated population standard deviation?

coffee_focus %>% bootstrap_distn %>%

summarize(sd_flavor = sd(flavor)) %>% summarize(sd_mean_flavor = sd(resample_mean)) %>%
pull(sd_flavor) pull(sd_mean_flavor)

0.3525 0.01572

SAMPLING IN R
Sample, bootstrap dist'n, pop'n standard deviations
Sample standard deviation Estimated population standard deviation

coffee_focus %>% standard_error <- bootstrap_distn %>%

summarize(sd_flavor = sd(flavor)) %>% summarize(sd_mean_flavor = sd(resample_mean)) %>%
pull(sd_flavor) pull(sd_mean_flavor)

0.3525 standard_error * sqrt(500)

0.3515

True standard deviation Standard error is the standard deviation of

the statistic of interest.
coffee_ratings %>%
summarize(sd_flavor = sd(flavor)) %>%
pull(sd_flavor) Standard error times square root of sample
size estimates the population standard
0.3414
deviation.

SAMPLING IN R
Interpreting the standard errors
Estimated standard error is the standard deviation of the bootstrap distribution for a sample
statistic.

The bootstrap distribution standard error times the square root of the sample size
estimates the standard deviation in the population.

SAMPLING IN R
Let's practice!
SAMPLING IN R
Confidence intervals
SAMPLING IN R

Richie Cotton
Data Evangelist at DataCamp
Confidence intervals
"Values within one standard deviation of the mean" includes a large number of values from
each of these distributions.

We'll define a related concept called a confidence interval.

SAMPLING IN R
Predicting the weather
Rapid City, South Dakota in the United
States has the least predictable weather.

Your job is to predict the high temperature

there tomorrow.

SAMPLING IN R
Your weather prediction
point estimate = 47 °F (8.3 °C)
range of plausible high temperature values = 40 to 54 °F (4.4 to 12.8 °C)

SAMPLING IN R
You just reported a confidence interval
40 to 54 °F is a confidence interval
Sometimes written as 47 °F (40 °F, 54 °F) or 47 °F [40 °F, 54 °F]

... or, 47 ± 7 °F

7 °F is the margin of error

SAMPLING IN R
Bootstrap distribution of mean flavor
ggplot(coffee_boot_distn, aes(resample_mean)) +
geom_histogram(binwidth = 0.002)

SAMPLING IN R
Mean of the resamples
coffee_boot_distn %>%
summarize(
mean_resample_mean = mean(resample_mean)
)

# A tibble: 1 x 1
mean_resample_mean
<dbl>
1 7.5263

SAMPLING IN R
Mean plus or minus one standard deviation
coffee_boot_distn %>%
summarize(
mean_resample_mean = mean(resample_mean),
mean_minus_1sd = mean_resample_mean - sd(resample_mean),
mean_plus_1sd = mean_resample_mean + sd(resample_mean)
)

# A tibble: 1 x 3
mean_resample_mean mean_plus_1sd mean_minus_1sd
<dbl> <dbl> <dbl>
1 7.5263 7.5355 7.5171

SAMPLING IN R
Quantile method for confidence intervals
coffee_boot_distn %>%
summarize(
lower = quantile(resample_mean, 0.025),
upper = quantile(resample_mean, 0.975)
)

# A tibble: 1 x 2
lower upper
<dbl> <dbl>
1 7.5087 7.5447

SAMPLING IN R
Inverse cumulative distribution function
PDF: The bell curve

CDF: integrate to get area under bell curve

Inv. CDF: flip x and y axes

normal_inv_cdf <- tibble(

p = seq(-0.001, 0.999, 0.001),
inv_cdf = qnorm(p)
)

ggplot(normal_inv_cdf, aes(p, inv_cdf)) +

geom_line()

1 See "Introduction to Statistics in R", Ch3, "The Normal Distribution"

SAMPLING IN R
Standard error method for confidence interval
coffee_boot_distn %>%
summarize(
point_estimate = mean(resample_mean),
std_error = sd(resample_mean),
lower = qnorm(0.025, point_estimate, std_error),
upper = qnorm(0.975, point_estimate, std_error)
)

# A tibble: 1 x 4
point_estimate std_error lower upper
<dbl> <dbl> <dbl> <dbl>
1 7.5263 0.0091815 7.5083 7.5443

SAMPLING IN R
Let's practice!
SAMPLING IN R
Congratulations!
SAMPLING IN R

Richie Cotton
Data Evangelist at DataCamp
Recap
Chapter 1 Chapter 3

Sampling basics Sample size and population parameters

Selection bias Creating sampling distributions

Pseudo-random sampling Approximate vs. actual sampling dist'ns

Central limit theorem

Chapter 2 Chapter 4

Simple random sampling Bootstrapping from a single sample

Systematic sampling Standard error

Stratified sampling Confidence intervals

Cluster sampling

SAMPLING IN R
The most important things
The standard deviation of the sampling distribution (a.k.a. the standard error) of a statistic is
well-approximated by the standard deviation of the bootstrap distribution of a statistic.

When calculating confidence intervals, it's OK to assume that bootstrap distributions are
approximately normally distributed.

SAMPLING IN R
What's next?
Analyzing Survey Data in R and Survey and Measurement Development in R
Experimental Design in R and A/B Testing in R

Foundations of Inference in R

Foundation of Probability in R and Fundamentals of Bayesian Data Analysis in R

SAMPLING IN R
Let's practice!
SAMPLING IN R

Bootstrapping and Sampling in Python
No ratings yet
Bootstrapping and Sampling in Python
41 pages
Sampling Techniques in R for Estimation
No ratings yet
Sampling Techniques in R for Estimation
29 pages
Python Sampling Techniques Explained
No ratings yet
Python Sampling Techniques Explained
140 pages
Relative Error in R Sampling Analysis
No ratings yet
Relative Error in R Sampling Analysis
29 pages
Understanding Bootstrapping for Estimation
No ratings yet
Understanding Bootstrapping for Estimation
2 pages
Bootstrap Sampling and Estimation Techniques
No ratings yet
Bootstrap Sampling and Estimation Techniques
2 pages
R Techniques: Sampling, Bootstrapping, Tests
No ratings yet
R Techniques: Sampling, Bootstrapping, Tests
21 pages
Bootstrap Methods: Parametric vs Non-Parametric
No ratings yet
Bootstrap Methods: Parametric vs Non-Parametric
25 pages
Python Sampling and Estimation Guide
No ratings yet
Python Sampling and Estimation Guide
32 pages
Resampling Methods in R: A Guide
No ratings yet
Resampling Methods in R: A Guide
49 pages
Mis Notas de R PDF
100% (1)
Mis Notas de R PDF
396 pages
R Sample Function for Random Selection
No ratings yet
R Sample Function for Random Selection
6 pages
Predicting Wine Color with Data Analysis
No ratings yet
Predicting Wine Color with Data Analysis
22 pages
R Graphing Basics with ggplot2
No ratings yet
R Graphing Basics with ggplot2
52 pages
Non-Parametric Bootstrapping Explained
No ratings yet
Non-Parametric Bootstrapping Explained
22 pages
Bootstrap Regression Analysis in R
No ratings yet
Bootstrap Regression Analysis in R
10 pages
R Basics and Statistical Methods Guide
No ratings yet
R Basics and Statistical Methods Guide
122 pages
Bootstrap Resampling Techniques in R
No ratings yet
Bootstrap Resampling Techniques in R
4 pages
Understanding Sampling Techniques in Statistics
No ratings yet
Understanding Sampling Techniques in Statistics
21 pages
Bootstrap and Jackknife Analysis in STAT 5400
No ratings yet
Bootstrap and Jackknife Analysis in STAT 5400
7 pages
Model Selection in Quantitative Economics
No ratings yet
Model Selection in Quantitative Economics
2 pages
Sampling Distributions in Real Estate Data
No ratings yet
Sampling Distributions in Real Estate Data
8 pages
R Data Filtering and Graphing Techniques
No ratings yet
R Data Filtering and Graphing Techniques
6 pages
Understanding Sampling Techniques in Statistics
No ratings yet
Understanding Sampling Techniques in Statistics
23 pages
Applied Statistics with R Guide
No ratings yet
Applied Statistics with R Guide
457 pages
Applied Statistics Using R Techniques
No ratings yet
Applied Statistics Using R Techniques
31 pages
Basics of Plotting with R and ggplot
No ratings yet
Basics of Plotting with R and ggplot
6 pages
Logistic Regression Analysis for Col Solare
No ratings yet
Logistic Regression Analysis for Col Solare
7 pages
Data Manipulation with dplyr and ggplot
No ratings yet
Data Manipulation with dplyr and ggplot
8 pages
Data Analytics Basics with R
No ratings yet
Data Analytics Basics with R
46 pages
Finalproj Aml
No ratings yet
Finalproj Aml
69 pages
Data Visualization with ggplot in R
No ratings yet
Data Visualization with ggplot in R
35 pages
R Programming Lecture Overview
No ratings yet
R Programming Lecture Overview
75 pages
R Programming for Statistics & Visualization
No ratings yet
R Programming for Statistics & Visualization
19 pages
Missing Values in Phone Call Data
No ratings yet
Missing Values in Phone Call Data
457 pages
R Data Analysis: Wine Quality Insights
No ratings yet
R Data Analysis: Wine Quality Insights
1 page
R Programming: Basic Operations & Plots
No ratings yet
R Programming: Basic Operations & Plots
22 pages
Applied Statistics with R Guide
No ratings yet
Applied Statistics with R Guide
417 pages
Sampling and Bootstrap Distributions Guide
No ratings yet
Sampling and Bootstrap Distributions Guide
14 pages
ggplot2 Data Visualization Examples
No ratings yet
ggplot2 Data Visualization Examples
13 pages
Data Visualization Techniques in R
No ratings yet
Data Visualization Techniques in R
75 pages
Coffee Data Sampling Distribution Activity
No ratings yet
Coffee Data Sampling Distribution Activity
3 pages
Data Visualization with R: ggplot2 Guide
No ratings yet
Data Visualization with R: ggplot2 Guide
8 pages
Introduction To Statistics Through Resampling Methods and R Phillip I. Good Ebook Digital Unlock
100% (3)
Introduction To Statistics Through Resampling Methods and R Phillip I. Good Ebook Digital Unlock
56 pages
Data Visualization Techniques in R
No ratings yet
Data Visualization Techniques in R
13 pages
Mastering Data Visualization with ggplot2
No ratings yet
Mastering Data Visualization with ggplot2
52 pages
Spatial Sampling With R.sanet - ST
No ratings yet
Spatial Sampling With R.sanet - ST
549 pages
Stat 202-0 Notes
No ratings yet
Stat 202-0 Notes
10 pages
Toronto Crime Rates and Bootstrap Analysis
No ratings yet
Toronto Crime Rates and Bootstrap Analysis
3 pages
Statistical Inference in R Guide
No ratings yet
Statistical Inference in R Guide
10 pages
Sample Surveys: Theory & Methods
No ratings yet
Sample Surveys: Theory & Methods
47 pages
Sampling Notes Overview
No ratings yet
Sampling Notes Overview
108 pages
Statistical Inference and Bootstrapping
0% (1)
Statistical Inference and Bootstrapping
2 pages
Data Visualization with ggplot2 in R
No ratings yet
Data Visualization with ggplot2 in R
10 pages
Data Science Techniques Using R
No ratings yet
Data Science Techniques Using R
38 pages
Bootstrapping Techniques in R
No ratings yet
Bootstrapping Techniques in R
63 pages
Coin Toss Probability in R
No ratings yet
Coin Toss Probability in R
30 pages
ANOVA and Experimental Design in R
No ratings yet
ANOVA and Experimental Design in R
18 pages
Data Importing Techniques in R
No ratings yet
Data Importing Techniques in R
27 pages
Importing Excel Data in R with readxl
No ratings yet
Importing Excel Data in R with readxl
15 pages
Data Cleaning Techniques in R
No ratings yet
Data Cleaning Techniques in R
41 pages
Importing Excel Data in R Guide
No ratings yet
Importing Excel Data in R Guide
22 pages
Hypothesis Testing in R: A/B Testing Insights
No ratings yet
Hypothesis Testing in R: A/B Testing Insights
32 pages
Jalabert. Montevideo 1930 Reassessing The Selection of The First World Cup Host
No ratings yet
Jalabert. Montevideo 1930 Reassessing The Selection of The First World Cup Host
14 pages
Data Cleaning Techniques in R
No ratings yet
Data Cleaning Techniques in R
39 pages
Giulianotti 1999 Intro
No ratings yet
Giulianotti 1999 Intro
15 pages
Mid-Century Modern Lounge Chair Plans
No ratings yet
Mid-Century Modern Lounge Chair Plans
24 pages
Numerato Who Says No To Modern Football
No ratings yet
Numerato Who Says No To Modern Football
19 pages
Tabla Z para Distribución Normal
No ratings yet
Tabla Z para Distribución Normal
6 pages
DIY Pikler Triangle Building Guide
100% (1)
DIY Pikler Triangle Building Guide
24 pages
Professional Development Plan for Educators
No ratings yet
Professional Development Plan for Educators
24 pages
Research Paper
No ratings yet
Research Paper
2 pages
NIH Cover Letter Format Guidelines
No ratings yet
NIH Cover Letter Format Guidelines
3 pages
Statistical Hypothesis Testing Methods
No ratings yet
Statistical Hypothesis Testing Methods
5 pages
Buffering Supplements for Exercise Performance
No ratings yet
Buffering Supplements for Exercise Performance
22 pages
Nist SP 800-137a
No ratings yet
Nist SP 800-137a
77 pages
Guide in Doing Qualitative Research Dr. Mbaleka
No ratings yet
Guide in Doing Qualitative Research Dr. Mbaleka
23 pages
Software Quality Assurance and Testing
100% (1)
Software Quality Assurance and Testing
24 pages
Drowsiness Detection with Computer Vision
No ratings yet
Drowsiness Detection with Computer Vision
2 pages
Types of Instructional Materials Explained
No ratings yet
Types of Instructional Materials Explained
18 pages
Excel Functions for Statistical Analysis
No ratings yet
Excel Functions for Statistical Analysis
2 pages
SGS Inspection Report for Urea in Tianjin
No ratings yet
SGS Inspection Report for Urea in Tianjin
3 pages
Menopausal Hormone Therapy Review
No ratings yet
Menopausal Hormone Therapy Review
127 pages
Question Paper Jee
No ratings yet
Question Paper Jee
39 pages
Training and Development at PNB
100% (8)
Training and Development at PNB
57 pages
YouTube Lectures on Probability & Statistics
No ratings yet
YouTube Lectures on Probability & Statistics
2 pages
An Integrated Approach To New Food Product Develop
No ratings yet
An Integrated Approach To New Food Product Develop
21 pages
Racial Disparities in Police Force Use
No ratings yet
Racial Disparities in Police Force Use
52 pages
SSCBS Faculty Development Program
No ratings yet
SSCBS Faculty Development Program
7 pages
Drainage System Assessment in Cagayan de Oro
No ratings yet
Drainage System Assessment in Cagayan de Oro
29 pages
Hypnosis and Misleading Questions' Impact on Memory
No ratings yet
Hypnosis and Misleading Questions' Impact on Memory
20 pages
Probability Questions and Data Analysis
No ratings yet
Probability Questions and Data Analysis
16 pages
Large-Scale Archaeological Site Conservation in China
100% (1)
Large-Scale Archaeological Site Conservation in China
28 pages
ECRI - Trusted Voice in Healthcare
No ratings yet
ECRI - Trusted Voice in Healthcare
2 pages
Quality by Design (QBD) Approach in Pharmaceuticals: Status, Challenges and Next Steps
No ratings yet
Quality by Design (QBD) Approach in Pharmaceuticals: Status, Challenges and Next Steps
8 pages
Readiness for Open Book Exams in Students
No ratings yet
Readiness for Open Book Exams in Students
8 pages
Understanding Appendices in Research
No ratings yet
Understanding Appendices in Research
48 pages
Slides - Sentiment Analysis
No ratings yet
Slides - Sentiment Analysis
5 pages
Cognitive Rehabilitation Guidebook
No ratings yet
Cognitive Rehabilitation Guidebook
17 pages
Analyzing Laptop Buying Decisions
No ratings yet
Analyzing Laptop Buying Decisions
18 pages

Bootstrapping and Sampling in R

Uploaded by

Bootstrapping and Sampling in R

Uploaded by

Introduction to

Sampling: going from a population to a

Bootstrapping: building up a theoretical

Bootstrapping use case

Develop understanding of sampling

3. Repeat steps 1 and 2 many times.

# Step 2. Calculate statistic

coffee_sample %>% bootstrap_distn %>%

True population mean

coffee_focus %>% bootstrap_distn %>%

coffee_focus %>% standard_error <- bootstrap_distn %>%

0.3525 standard_error * sqrt(500)

True standard deviation Standard error is the standard deviation of

We'll define a related concept called a confidence interval.

Your job is to predict the high temperature

7 °F is the margin of error

CDF: integrate to get area under bell curve

Inv. CDF: flip x and y axes

normal_inv_cdf <- tibble(

ggplot(normal_inv_cdf, aes(p, inv_cdf)) +

1 See "Introduction to Statistics in R", Ch3, "The Normal Distribution"

Sampling basics Sample size and population parameters

Selection bias Creating sampling distributions

Central limit theorem

Simple random sampling Bootstrapping from a single sample

Systematic sampling Standard error

Stratified sampling Confidence intervals

Foundation of Probability in R and Fundamentals of Bayesian Data Analysis in R

You might also like