Different Types of Data - BioStatistics
Different Types of Data - BioStatistics
Introduction to Statistics
What is Statistics
Most of you have probably thought about statistics, and maybe even used informal
statistical methods to help explain something you’ve experienced, or perhaps to win an
argument. Out of the number of students in the class (call it n), I would guess we’d have
n different definitions of Statistics, all of them somewhat similar, but none exactly the
same.
I would guess that maybe 80% of you would define Statistics as a Mathematics course.
I would also venture to say that maybe 40% of you have some anxiety about taking this
course because math isn’t your sharpest skill. Let me tell you now that Statistics does
use math, but I don’t call it a math course. If you’ve ever taken a Analytical Chemistry
course, or a Physics course, did you label it as a math course? Probably not. It was a
Chemistry course, or Physics, or whatever. I ask you to view this course, and the field of
Statistics similar to how you view the courses above. Do they use math? Yes. Is that the
main focus? No.
I think of Statistics as the science of collecting, explaining and interpreting data. There is
a fine distinction between “explaining” and “interpreting.” I can use math to explain data,
like calculating an average, but what does that average mean for the dataset? For the
population? Taking the numbers and interpreting them is, to me, the most important part
of the field of Statistics. In readings and elsewhere, this is called inferential statistics.
Making inferences from data about populations is going to be the main focus of this
course. We’re going to use a software package called R to do math for us, which will
allow us to think solely about what results mean.
Where is Statistics used
In short, everywhere! While our focus in this course will be Statistics in the Biological
sciences, the tools you’ll gain here will apply to your everyday life. By the end of this
course, I hope you are able to watch the news, or read social media and start asking
questions about the statements being made.
Now that we have a better understanding of what we’re getting into, let’s begin where
any statistical study begins: the data
There are many ways to gather data, and some are better than others. In this module,
we’ll discuss different types of experiments, some sampling methods, and a brief
introduction to some popular Experimental Designs in Statistics.
Experimental
Experimental studies are most widely used in scientific fields, like medical or biology.
The reason for that is because experimental studies provide the researchers with the
most control. Researchers determine what their independent variable(s) are, and what
and how they will measure the dependent variable. Most of the examples we’ve talked
about so far have been experimental studies: think the diet and exercise example from
the last learning module. Researchers decided how many different types of exercises
and diets to use, as well as which types of those two explanatory variables. Subjects
used in the study will be assigned to the different levels of diet and exercise by the
researchers, thus providing the researchers with the most control.
There are many popular experimental studies, especially in the field of medicine. The
most popular is a double blind study, where neither the researcher nor the subject know
which level or the independent variable(s) they’ve been assigned to.
Because of the amount of control granted in experimental studies, we are able to make
inferences about the data. For me, inference is the most important thing we can do with
data. It’s when we’re able to take observations from a select group of subjects (our
sample), and make generalizations about an entire group of subjects, be it people,
frogs, or whatever.
Observational
Observational studies don’t provide the same amount of control for the researchers that
experimental studies do. In observational studies, researchers can determine
independent variable(s), but they just have to observe the dependent variable, not
knowing if the independent variable(s) caused the change in the dependent variable, or
if it was some other unmeasured variable.
Consider you want to measure purchases at a mall based on what people are wearing.
You have two levels of your independent variable (clothes): fancy clothes, non-fancy
clothes. You track people that fall into those two categories and see what they buy.One
possible inference drawn from this study would be that people in fancy clothes are more
likely to buy big screen televisions than people dressed in non-fancy clothes. So do you
think that fancy clothes will generally lead people to buy big screen televisions?
Probably not, or there is another variable unmeasured that is the actual cause (like
income).
Since we don’t have that much control, observational studies don’t provide as powerful
of inferences as experimental studies do. Researchers like experimental studies over
observational studies because we are able to infer from experimental studies better
than observational studies.
Surveys are very popular in the world, and I’m sure that all of you have filled out a
survey at some time in your life. While researchers designed those surveys, they have
no control over how you answer (i.e. just observing your responses). That makes
surveys observational studies. We should be wary of inferences made from surveys
because of this reason. That being said, there are situations where a survey is the only
way we’ll be able to get any data on a subject. If we were interested in, say, how well
does marijuana help with pain relief, we can’t set up an experimental study where we hit
somebody’s toe with a hammer, hand them a joint, and ask how they feel. A survey will
be the only way we can learn that response. So inferences can be made from
observational studies, like surveys, but I wouldn’t buy into the inferences as much as I
would from an experimental study.
Hide notes
Throughout this course we’ll be talking about populations, samples, and the difference
between the two. It will be important that we understand what both are, and that those
differences exist.
Parameters are values of interest of the population. These are usually some calculated
value, like and average, or a maximum. In Statistics, we usually symbolize these
parameters with Greek letters. We’ll talk more about which letters are used to represent
different parameters later.
These parameters can only be known if we have measurements from all subjects of the
population. If (and it can even be an impossible if) the parameter is known, then there is
no reason to do statistical analysis on the data. Our inference methods in statistics help
us make generalizations about those population parameters. If we already know them,
why go through the work?
We can think of instances, however, where it won’t be possible to gather all the
population. In the frog example, to know a given parameter, we would need to measure
all the frogs in that species. How difficult would it be to find every single frog of that
species? To use another example to get this point across, what if we’re doing a study
on leaf weight for trees in a local park. Would you want to go out and weigh every single
leaf for all the trees in the park? I wouldn’t. That’s why taking samples will be important.
Samples are subsets of the population of interest. There are multiple ways samples can
be collected, some of which we’ll talk about in a bit. Samples will be important because
they will allow us to understand an entire population better.
Values that we get from samples are called statistics. These statistics will always be
known, since our sample data will be at hand. You may not be able to tell me the
parameter average of the heights of Doane students, but you can easily gather a
sample of students to calculate that average.
When we take samples, and calculate sample statistics, we can (and will) then use
those sample statistics to make inferences about the population parameters.
Since we’re using sample statistics to make inferences about those population
parameters, we need to make sure those statistics will represent the population. One of
the ways that won’t happen is when there is bias in the sample. In general, bias occurs
when a sample does not represent the population of interest. When bias happens, our
inferences suffer. We’re going to talk about three main types of bias.
Sampling Bias
A sampling bias occurs when the sample was not random, or a population was under
covered (meaning that not every subject of the population had a chance of being
selected). If I was trying to determine the average height of Doane students, but I only
sampled students on the one campus, I have created a sampling bias because I didn’t
include the other 3 campuses.
Response Bias
Response bias occurs when the subject is influenced to respond a certain way. This can
easily be thought about in a survey setting, but it doesn’t just exist in observational
studies. If I was running the frog temperature study from the last learning module, and I
noticed that the frogs in the 60oF tank were shivering. Let’s say that I felt bad, so I
kicked up the temp a little. I’m now influencing the response to be inaccurate because
I’ve changed the study.
Nonresponse Bias
Nonresponse bias simple occurs when subjects don’t respond. If you’ve ever hung up
on a telemarketer, or deleted any emails with surveys in them, you may have provided
some nonresponse bias. Nonresponse bias will only be a concern if a large proportion
of those sampled don’t respond, and that percentage will be dependent on the sample
size. If I asked the class if you prefer cookies or brownies, and I only had two students
respond, that sample probably won’t represent the entire class (so nonresponse bias
would be a huge issue). Voter turnout for national elections is usually around 56%, but
we don’t usually consider the President-Elect to be bogus because of nonresponse bias
since 56% of the US population is still a large number of people.
Hide notes
There are many ways to collect samples. We’ll talk about some of the more popular ones.
A simple random sample (SRS) is when each subject of a population has an equally likely
chance of being selected. Pulling names out of a hat, or using a random number generator to
select subjects are ways of getting a SRS. These types of samples are preferred to non-random
samples, but it doesn’t mean that we will avoid bias. Consider a population of 20 people, 10
men and 10 women. If I randomly select 5 people, I could still end up with all five of the sampled
people be women, or all five being men. If you find yourself in a situation where that may be a
concern, use a different method.
Stratified
In a stratified sample, we subset the population into groups, called strata. These strata are
mutually exclusive, and are determined based on some quality that is easily defined. At the high
school level, we can set up strata based on class rank (first year, sophomore, junior, senior).
Once we have strata defined, to obtain the sample, we’ll randomly select a number of subjects
from each strata (say 10 from each class rank). This will ensure you have representation from
each strata.
Cluster
A cluster sample is similar to stratified sampling in that we group the population into groups, this
time called clusters. Clusters are usually based more on convenience, and subjects shouldn’t be
all of the same type. If I wanted to determine heights of Doane students just on Crete campus, I
can set up clusters based on the 6 different living options (5 dorms, and off campus). With the
way things are set up, there are different age groups living in all areas, and we have no all-one-
sex dorms, so the students are pretty well randomized within the living spaces. To gather data,
we’ll then randomly select clusters, and measure all subjects in the clusters selected.
Convenience
Convenience samples are the easiest to gather, but the sampling method that shouldn’t be
used. I sometimes call this the lazy sample because observations are collected by convenience.
Asking friends, coworkers, family to participate because they are close to you/easy for you to
ask most likely will not provide you with a good representation of the population.
There are some experiments that have built in structure to them. There are many
different types, some more complex than others, and experimental design is an entire
field of Statistics. Most of the designs are outside the scope of this class, but we’ll briefly
talk about some of the more frequent ones.
A good experimental design will try to capture all known sources of variation. This will
help us in understanding how well independent variables predict dependent variables.
Terminology
There is terminology from experiments that will help us understand design setups, and
will be important to this class moving forward.
The first thing we’ll talk about is treatment. A treatment is a level of an explanatory
variable. Think back to the frog example. We had one explanatory variable
(temperature), and we had three levels (60, 70, 80o F). The temperature values are
treatments, so there is one explanatory variable, three treatments. Treatments are what
the researcher applies to the subject.
Another term is experimental unit (e.u.). An experimental unit is the smallest possible
unit in which the treatment can be applied. If we’re dealing with people, we’d often think
that the experimental unit would be one person, and that’s usually the case. There are
instances, however, where the experimental unit will not be an individual subject.
Assume I’m testing what type of diet will increase weight in pigs, with 5 sties, 3 pigs per
sty. Also assume I’m testing 5 different food slops (five treatments). If I place the
different types of food in troughs (one trough for each cage), the smallest individual unit
that receives the treatment is all the pigs in the given sty. So I would have five
experimental units since I have five troughs.
Factorial
The last type of design we’ll talk about is a factorial design. A factorial design has more
than one independent variable (usually two, but can be more). The diet and exercise
example we’ve been working with so far in this course has been an example of a
factorial design. In factorial designs, we are not only interested in the two independent
variables, but also how the two react together (we call this an interaction term).
• section 3 - Measurement