0% found this document useful (0 votes)
60 views

Lab 3

The document discusses an introduction to data analysis using R. It describes loading baptism record data from the 1700s and exploring it by plotting variables and calculating ratios. It also discusses loading more recent US birth data and comparing trends. Key questions are answered about summarizing and visualizing data in R.

Uploaded by

Work Clothing
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views

Lab 3

The document discusses an introduction to data analysis using R. It describes loading baptism record data from the 1700s and exploring it by plotting variables and calculating ratios. It also discusses loading more recent US birth data and comparing trends. Key questions are answered about summarizing and visualizing data in R.

Uploaded by

Work Clothing
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

Lab3 : Introduction to data

Part 1 : Introduction to R


Enter the following command at the R prompt (i.e. right after > on the console). You can either type it in
manually or copy and paste it from this document.

source("http://www.openintro.org/stat/data/arbuthnot.R")

This command instructs R to access the OpenIntro website and fetch some data: the Arbuthnot baptism counts
for boys and girls. You should see that the workspace area in the upper righthand corner of the RStudio window
now lists a data set called arbuthnot that has 82 observations on 3 variables.

The Data: Dr. Arbuthnot’s Baptism Records


The Arbuthnot data set refers to Dr. John Arbuthnot, an 18th century physician, writer, and mathematician. He
was interested in the ratio of newborn boys to newborn girls, so he gathered the baptism records for children
born in London for every year from 1629 to 1710. We can take a look at the data by typing its name into the
console.

arbuthnot

What you should see are four columns of numbers, each row representing a different year: the first entry in each
row is simply the row number (an index we can use to access the data from individual years if we want), the
second is the year, and the third and fourth are the numbers of boys and girls baptized that year, respectively.
Use the scrollbar on the right side of the console window to examine the complete data set.
Note that the row numbers in the first column are not part of Arbuthnot’s data. R adds them as part of its printout
to help you make visual comparisons. You can think of them as the index that you see on the left side of a
spreadsheet. In fact, the comparison to a spreadsheet will generally be helpful. R has stored Arbuthnot’s data in a
kind of spreadsheet or table called a data frame.
You can see the dimensions of this data frame by typing:

dim(arbuthnot)
## [1] 82 3

This command should output [1] 82 3, indicating that there are 82 rows and 3 columns (we’ll get to what
the [1] means in a bit), just as it says next to the object in your workspace. You can see the names of these
columns (or variables) by typing:

names(arbuthnot)
## [1] "year" "boys" "girls"

You should see that the data frame contains the columns year, boys, and girls. At this point, you might notice
that many of the commands in R look a lot like functions from math class; that is, invoking R commands means
supplying a function with some number of arguments. The dim and names commands, for example, each took a
single argument, the name of a data frame.
One advantage of RStudio is that it comes with a built-in data viewer. Click on the name  arbuthnot in
the Environment pane (upper right window) that lists the objects in your workspace. This will bring up an
alternative display of the data set in the Data Viewer (upper left window). You can close the data viewer by
clicking on the x in the upper lefthand corner.

1
Some Exploration
Let’s start to examine the data a little more closely. We can access the data in a single column of a data frame
separately using a command like
arbuthnot$boys
This command will only show the number of boys baptized each year.

Q1. What command would you use to extract just the counts of girls baptized? Try it!
R has some powerful functions for making graphics. We can create a simple plot of the number of girls baptized
per year with the command

plot(x = arbuthnot$year, y = arbuthnot$girls)

By default, R creates a scatterplot with each x,y pair indicated by an open circle. The plot itself should appear
under the Plots tab of the lower right panel of RStudio.. If we wanted to connect the data points with lines, we
could add a third argument, the letter l for line.

plot(x = arbuthnot$year, y = arbuthnot$girls, type = "l")

You might wonder how you are supposed to know that it was possible to add that third argument. Thankfully, R
documents all of its functions extensively. To read what a function does and learn the arguments that are
available to you, just type in a question mark followed by the name of the function that you’re interested in. Try
the following.

?plot

Q2. Is there an apparent trend in the number of girls baptized over the years? How would you describe it?
Now, suppose we want to plot the total number of baptisms. To compute this, we could use the fact that R is
really just a big calculator. We can type in mathematical expressions like

5218 + 4683

to see the total number of baptisms in 1629. We could repeat this once for each year, but there is a faster way. If
we add the vector for baptisms for boys and girls, R will compute all sums simultaneously.

arbuthnot$boys + arbuthnot$girls

What you will see are 82 numbers (in that packed display, because we aren’t looking at a data frame here), each
one representing the sum we’re after. Take a look at a few of them and verify that they are right. Therefore, we
can make a plot of the total number of baptisms per year with the command

plot(arbuthnot$year, arbuthnot$boys + arbuthnot$girls, type = "l")

This time, note that we left out the names of the first two arguments. We can do this because the help file shows
that the default for plot is for the first argument to be the x-variable and the second argument to be the y-
variable.
Similarly to how we computed the proportion of boys, we can compute the ratio of the number of boys to the
number of girls baptized in 1629 with

5218 / 4683

or we can act on the complete vectors with the expression

arbuthnot$boys / arbuthnot$girls

The proportion of newborns that are boys

5218 / (5218 + 4683)

2
or this may also be computed for all years simultaneously:

arbuthnot$boys / (arbuthnot$boys + arbuthnot$girls)

Note that with R as with your calculator, you need to be conscious of the order of operations. Here, we want to
divide the number of boys by the total number of newborns, so we have to use parentheses. Without them, R will
first do the division, then the addition, giving you something that is not a proportion.
Q3. Now, make a plot of the proportion of boys over time. What do you see?
Tip: If you use the up and down arrow keys, you can scroll through your previous commands, your so-called
command history. You can also access it by clicking on the history tab in the upper right panel. This will save
you a lot of typing in the future.
Finally, in addition to simple mathematical operators like subtraction and division, you can ask R to make
comparisons like greater than, >, less than, <, and equality, ==. For example, we can ask if boys outnumber girls
in each year with the expression

arbuthnot$boys > arbuthnot$girls

This command returns 82 values of either TRUE if that year had more boys than girls, or FALSE if that year did
not (the answer may surprise you). This output shows a different kind of data than we have considered so far. In
the arbuthnot data frame our values are numerical (the year, the number of boys and girls). Here, we’ve asked R
to create logical data, data where the values are either TRUE or FALSE. In general, data analysis will involve
many different kinds of data types, and one reason for using R is that it is able to represent and compute with
many of them.
This seems like a fair bit for your first lab, so let’s stop here. To exit RStudio you can click the x in the upper
right corner of the whole window.
You will be prompted to save your workspace. If you click save, RStudio will save the history of your
commands and all the objects in your workspace so that the next time you launch RStudio, you will
see arbuthnot and you will have access to the commands you typed in your previous session. For now,
click save, then start up RStudio again.

Exercise 1
Q1. Load up the present day data with the following command.

source("http://www.openintro.org/stat/data/present.R")

The data are stored in a data frame called present.


Q2. What years are included in this data set? What are the dimensions of the data frame and what are the
variable or column names?
Q3. How do these counts compare to Arbuthnot’s? Are they on a similar scale?
Q4. Make a plot that displays the boy-to-girl ratio for every year in the data set. What do you see? Does
Arbuthnot’s observation about boys being born in greater proportion than girls hold up in the U.S.? Include the
plot in your response.
Q5. In what year did we see the most total number of births in the U.S.?

Part 2 : Introduction to data


The Behavioral Risk Factor Surveillance System (BRFSS) is an annual telephone survey of 350,000 people in
the United States. As its name implies, the BRFSS is designed to identify risk factors in the adult population and
report emerging health trends. For example, respondents are asked about their diet and weekly physical activity,
their HIV/AIDS status, possible tobacco use, and even their level of healthcare coverage. The BRFSS Web site
(http://www.cdc.gov/brfss) contains a complete description of the survey, including the research questions that
motivate the study and many interesting results derived from the data.

3
We will focus on a random sample of 20,000 people from the BRFSS survey conducted in 2000. While there are
over 200 variables in this data set, we will work with a small subset.
We begin by loading the data set of 20,000 observations into the R workspace. After launching RStudio, enter
the following command.

source("http://www.openintro.org/stat/data/cdc.R")

The data set cdc that shows up in your workspace is a data matrix, with each row representing a case and each
column representing avariable. R calls this data format a data frame, which is a term that will be used throughout
the labs.
To view the names of the variables, type the command

names(cdc)

This returns the names genhlth, exerany, hlthplan, smoke100, height, weight, wtdesire, age, and gender. Each


one of these variables corresponds to a question that was asked in the survey. For example, for genhlth,
respondents were asked to evaluate their general health, responding either excellent, very good, good, fair or
poor. The exerany variable indicates whether the respondent exercised in the past month (1) or did not (0).
Likewise, hlthplan indicates whether the respondent had some form of health coverage (1) or did not (0).
The smoke100variable indicates whether the respondent had smoked at least 100 cigarettes in her lifetime. The
other variables record the respondent’sheight in inches, weight in pounds as well as their desired
weight, wtdesire, age in years, and gender.
Q1. How many cases are there in this data set? How many variables? For each variable, identify its data
type (e.g. categorical, discrete).
We can have a look at the first few entries (rows) of our data with the command

head(cdc)

and similarly we can look at the last few by typing

tail(cdc)

You could also look at all of the data frame at once by typing its name into the console, but that might be unwise
here. We know cdc has 20,000 rows, so viewing the entire data set would mean flooding your screen. It’s better
to take small peeks at the data with head, tail or the subsetting techniques that you’ll learn in a moment.

Summaries and tables


The BRFSS questionnaire is a massive trove of information. A good first step in any analysis is to distill all of
that information into a few summary statistics and graphics. As a simple example, the function summary returns
a numerical summary: minimum, first quartile, median, mean, second quartile, and maximum. For weight this is

summary(cdc$weight)

R also functions like a very fancy calculator. If you wanted to compute the interquartile range for the
respondents’ weight, you would look at the output from the summary command above and then enter

190 - 140

R also has built-in functions to compute summary statistics one by one. For instance, to calculate the mean,
median, and variance of weight, type

mean(cdc$weight)
var(cdc$weight)
median(cdc$weight)

While it makes sense to describe a quantitative variable like weight in terms of these statistics, what about
categorical data? We would instead consider the sample frequency or relative frequency distribution. The

4
function table does this for you by counting the number of times each kind of response was given. For example,
to see the number of people who have smoked 100 cigarettes in their lifetime, type

table(cdc$smoke100)

or instead look at the relative frequency distribution by typing

table(cdc$smoke100)/20000

Notice how R automatically divides all entries in the table by 20,000 in the command above. This is similar to
something we observed in the Introduction to R; when we multiplied or divided a vector with a number, R
applied that action across entries in the vectors. As we see above, this also works for tables. Next, we make a bar
plot of the entries in the table by putting the table inside the barplot command.

barplot(table(cdc$smoke100))

Notice what we’ve done here! We’ve computed the table of cdc$smoke100 and then immediately applied the
graphical function, barplot. This is an important idea: R commands can be nested. You could also break this into
two steps by typing the following:

smoke <- table(cdc$smoke100)

barplot(smoke)

Here, we’ve made a new object, a table, called smoke (the contents of which we can see by typing smoke into
the console) and then used it in as the input for barplot. The special symbol <- performs an assignment, taking
the output of one line of code and saving it into an object in your workspace. This is another important idea that
we’ll return to later.
Q2. Create a numerical summary for height and age, and compute the interquartile range for each.
Compute the relative frequency distribution for gender and exerany. How many males are in the sample?
What proportion of the sample reports being in excellent health?
The table command can be used to tabulate any number of variables that you provide. For example, to examine
which participants have smoked across each gender, we could use the following.

table(cdc$gender,cdc$smoke100)

Here, we see column labels of 0 and 1. Recall that 1 indicates a respondent has smoked at least 100 cigarettes.
The rows refer to gender. To create a mosaic plot of this table, we would enter the following command.

mosaicplot(table(cdc$gender,cdc$smoke100))

We could have accomplished this in two steps by saving the table in one line and applying mosaicplot in the next
(see the table/barplot example above).
Q3. What does the mosaic plot reveal about smoking habits and gender?

Interlude: How R thinks about data


We mentioned that R stores data in data frames, which you might think of as a type of spreadsheet. Each row is a
different observation (a different respondent) and each column is a different variable (the first is genhlth, the
second exerany and so on). We can see the size of the data frame next to the object name in the workspace or we
can type

dim(cdc)

which will return the number of rows and columns. Now, if we want to access a subset of the full data frame, we
can use row-and-column notation. For example, to see the sixth variable of the 567th respondent, use the format

cdc[567,6]

5
which means we want the element of our data set that is in the 567th row (meaning the 567th person or
observation) and the 6th column (in this case, weight). We know that weight is the 6th variable because it is the
6th entry in the list of variable names

names(cdc)

To see the weights for the first 10 respondents we can type

cdc[1:10,6]

In this expression, we have asked just for rows in the range 1 through 10. R uses the : to create a range of values,
so 1:10 expands to 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. You can see this by entering

1:10

Finally, if we want all of the data for the first 10 respondents, type

cdc[1:10,]

By leaving out an index or a range (we didn’t type anything between the comma and the square bracket), we get
all the columns. When starting out in R, this is a bit counterintuitive. As a rule, we omit the column number to
see all columns in a data frame. Similarly, if we leave out an index or range for the rows, we would access all the
observations, not just the 567th, or rows 1 through 10. Try the following to see the weights for all 20,000
respondents fly by on your screen

cdc[,6]

Recall that column 6 represents respondents’ weight, so the command above reported all of the weights in the
data set. An alternative method to access the weight data is by referring to the name. Previously, we
typed names(cdc) to see all the variables contained in the cdc data set. We can use any of the variable names to
select items in our data set.

cdc$weight

The dollar-sign tells R to look in data frame cdc for the column called weight. Since that’s a single vector, we
can subset it with just a single index inside square brackets. We see the weight for the 567th respondent by
typing

cdc$weight[567]

Similarly, for just the first 10 respondents

cdc$weight[1:10]

The command above returns the same result as the cdc[1:10,6] command. Both row-and-column notation and
dollar-sign notation are widely used, which one you choose to use depends on your personal preference.

A little more on subsetting


It’s often useful to extract all individuals (cases) in a data set that have specific characteristics. We accomplish
this through conditioningcommands. First, consider expressions like

cdc$gender == "m"

or

cdc$age > 30

These commands produce a series of TRUE and FALSE values. There is one value for each respondent,
where TRUE indicates that the person was male (via the first command) or older than 30 (second command).

6
Suppose we want to extract just the data for the men in the sample, or just for those over 30. We can use the R
function subset to do that for us. For example, the command

mdata <- subset(cdc, cdc$gender == "m")

will create a new data set called mdata that contains only the men from the cdc data set. In addition to finding it
in your workspace alongside its dimensions, you can take a peek at the first several rows as usual

head(mdata)

This new data set contains all the same variables but just under half the rows. It is also possible to tell R to keep
only specific variables, which is a topic we’ll discuss in a future lab. For now, the important thing is that we can
carve up the data based on values of one or more variables.
As an aside, you can use several of these conditions together with & and |. The & is read “and” so that

m_and_over30 <- subset(cdc, gender == "m" & age > 30)

will give you the data for men over the age of 30. The | character is read “or” so that

m_or_over30 <- subset(cdc, gender == "m" | age > 30)

will take people who are men or over the age of 30 (why that’s an interesting group is hard to say, but right now
the mechanics of this are the important thing). In principle, you may use as many “and” and “or” clauses as you
like when forming a subset.
Q3. Create a new object called under23_and_smoke that contains all observations of respondents under
the age of 23 that have smoked 100 cigarettes in their lifetime. Write the command you used to create the
new object as the answer to this exercise.

Quantitative data
With our subsetting tools in hand, we’ll now return to the task of the day: making basic summaries of the BRFSS
questionnaire. We’ve already looked at categorical data such as smoke and gender so now let’s turn our attention
to quantitative data. Two common ways to visualize quantitative data are with box plots and histograms. We can
construct a box plot for a single variable with the following command.

boxplot(cdc$height)

You can compare the locations of the components of the box by examining the summary statistics.

summary(cdc$height)

Confirm that the median and upper and lower quartiles reported in the numerical summary match those in the
graph. The purpose of a boxplot is to provide a thumbnail sketch of a variable for the purpose of comparing
across several categories. So we can, for example, compare the heights of men and women with

boxplot(cdc$height ~ cdc$gender)

The notation here is new. The ~ character can be read versus or as a function of. So we’re asking R to give us a
box plots of heights where the groups are defined by gender.
Next let’s consider a new variable that doesn’t show up directly in this data set: Body Mass Index (BMI)
(http://en.wikipedia.org/wiki/Body_mass_index). BMI is a weight to height ratio and can be calculated as:
BMI=weight (lb)height (in)2∗703
703 is the approximate conversion factor to change units from metric (meters and kilograms) to imperial (inches
and pounds).
The following two lines first make a new object called bmi and then creates box plots of these values, defining
groups by the variablecdc$genhlth.

bmi <- (cdc$weight / cdc$height^2) * 703

7
boxplot(bmi ~ cdc$genhlth)

Notice that the first line above is just some arithmetic, but it’s applied to all 20,000 numbers in the  cdc data set.
That is, for each of the 20,000 participants, we take their weight, divide by their height-squared and then
multiply by 703. The result is 20,000 BMI values, one for each respondent. This is one reason why we like R: it
lets us perform computations like this using very simple expressions.
Q4. What does this box plot show? Pick another categorical variable from the data set and see how it relates to
BMI. List the variable you chose, why you might think it would have a relationship to BMI, and indicate what
the figure seems to suggest.
Finally, let’s make some histograms. We can look at the histogram for the age of our respondents with the
command

hist(cdc$age)

Histograms are generally a very good way to see the shape of a single distribution, but that shape can change
depending on how the data is split between the different bins. You can control the number of bins by adding an
argument to the command. In the next two lines, we first make a default histogram of bmi and then one with 50
breaks.

hist(bmi)
hist(bmi, breaks = 50)

Note that you can flip between plots that you’ve created by clicking the forward and backward arrows in the
lower right region of RStudio, just above the plots. How do these two histograms compare?
At this point, we’ve done a good first pass at analyzing the information in the BRFSS questionnaire. We’ve
found an interesting association between smoking and gender, and we can say something about the relationship
between people’s assessment of their general health and their own BMI. We’ve also picked up essential
computing tools – summary statistics, subsetting, and plots – that will serve us well throughout this course.

Exercise2
Q1. Make a scatterplot of weight versus desired weight. Describe the relationship between these two variables.
Q2. Let’s consider a new variable: the difference between desired weight (wtdesire) and current weight (weight).
Create this new variable by subtracting the two columns in the data frame and assigning them to a new object
called wdiff.
Q3. What type of data is wdiff? If an observation wdiff is 0, what does this mean about the person’s weight and
desired weight. What ifwdiff is positive or negative?
Q4. Describe the distribution of wdiff in terms of its center, shape, and spread, including any plots you use. What
does this tell us about how people feel about their current weight?
Q5. Using numerical summaries and a side-by-side box plot, determine if men tend to view their weight
differently than women.
Q6. Now it’s time to get creative. Find the mean and standard deviation of weight and determine what proportion
of the weights are within one standard deviation of the mean.

You might also like