whSample helps analysts quickly generate statistical samples from Excel or Comma Separated Value (CSV) files and write them to a new Excel workbook. Users have a choice of Simple Random or Stratified Random samples, and a third choice of having each stratum included in a separate worksheet.
See package vignettes for detailed documentation.
The workhorse function is sampler. A helper function, ssize, estimates the minimum sample size necessary to achieve statistical requirements using a Normal Approximation to the Hypergeometric Distribution. This distribution spans the probabilities of yes/no-type responses without replacement. These parameters are:
- N, the population size.
- ci, the required confidence interval. The default is 95%.
- me, the required level of precision, or margin of error. The default is +/- 7%.
- p, the anticipated rate of occurrence. The default is 50%.
ssize(N, ci=0.95, me=0.07, p=0.50) (showing the defaults) only requires the N argument. Used as a standalone, it can be used to explore sample sizes under other conditions. For example, a probe sample may suggest that a 50-50 probability isn’t realistic. A revised sample size can be estimated with the observed success probability (p=0.6, for example).
The sampler function calls ssize to get its sample size estimate. Therefore, it requires the ci, me, and p arguments, which it passes to ssize.
sampler also takes four additional arguments:
- irisData opens the file chooser to a folder with example files of Anderston’s Iris dataset of flower characteristics.
- backups provides a buffer for use if necessary to replace samples found to be invalid for some reason,
- seed is used to seed the internal random number generator, and
- keepOrg determines if a copy of the population is included in the output.
The defaults for these additional arguments are backups=5, irisData=F, seed=NULL and keepOrg=F. The default seed will tell sampler to use the current system time in milliseconds. Any number can be used as a seed. Whichever one is used will be listed in the Report output tab. The keep-original option (keepOrg) defaults to FALSE, but could be set to keepOrg=T for smaller populations that wouldn’t exceed Excel’s row limit is 1,048,576 rows.
To override any of these defaults, enter name=value as an argument.
sampler uses a series of menus to guide users through the sampling process.
sampler creates a new Excel workbook with three parts:
-
a copy of the original (source) data if previously requested,
-
an Excel spreadsheet with the requested sample, and
-
a new tab called Report with key reference information:
-
path and name of the source file
-
size (in rows) of the source file
-
sample type (Simple Random Sample, Stratified Random Sample, or Tabbed Stratified Sample)
-
sampling parameters
-
sample size
-
stratification key
-
number of strata
-
number of backups requested (this number is applied to every stratum in a stratified sample)
-
random number seed used, for documentation and reproducibility
-
date-time stamp of when the sample was generated
-
stratification information (name, number in the population, proportion of the population, and the number of samples)
-
You can install whSample from CRAN with:
install.packages("whSample")or get the latest developmental version with:
devtools::install_github("km4ivi/whSample")sampler depends on several external packages to run properly. If you’re running a developmental version, make sure these packages are installed on your computer:
- tidyverse (or individually: magrittr, dplyr, purrr)
- openxlsx
- data.table
- tools
- utils
- tcltk
- bit64
ssize(5000): N=5000, other arguments use defaults
ssize(5000, p=0.60): N=5000, with a 60% expected rate of occurrence
sampler(): Uses all defaults, gets N from the source data.
sampler(backups=2, seed=12345): Overrides specific defaults