Data Manipulation With Dplyr
Data Manipulation With Dplyr
Our data
Reading in data
The dplyr package
dplyr verbs
lter()
select()
mutate()
arrange()
summarize()
group_by()
The pipe: %>%
How %>% works
Nesting versus %>%
Piping exercises
Homework
Review
Our data
We’re going to use the yeast gene expression dataset described on the data
frames lesson (r-dataframes.html#our-data). This is a cleaned up version of
a gene expression dataset from Brauer et al. Coordination of Growth Rate,
Cell Cycle, Stress Response, and Metabolic Activity in Yeast (2008) Mol Biol
Cell 19:352-367 (http://www.ncbi.nlm.nih.gov/pubmed/17959824). This data
is from a gene expression microarray, and in this paper the authors are
examining the relationship between growth rate and gene expression in
yeast cultures limited by one of six di erent nutrients (glucose, leucine,
ammonium, sulfate, phosphate, uracil). If you give yeast a rich media loaded
with nutrients except restrict the supply of a single nutrient, you can control
the growth rate to any rate you choose. By starving yeast of speci c
nutrients you can nd genes that:
You can download the cleaned up version of the data at the link above
(data.html). The le is called brauer2007_tidy.csv
(data/brauer2007_tidy.csv). Later on we’ll actually start with the original raw
data (minimally processed) and manipulate it so that we can make it more
amenable for analysis.
Reading in data
We need to load both the dplyr and readr packages for e ciently reading in
and displaying this data. We’re also going to use many other functions from
the dplyr package. Make sure you have these packages installed as
described on the setup page (setup.html).
# Load packages
library(readr)
library(dplyr)
# Read in data
ydat <- read_csv(file="data/brauer2007_tidy.csv")
When you read in data with the readr package ( read_csv() ) and you had
the dplyr package loaded already, the data frame takes on this “special”
class of data frames called a tbl (pronounced “tibble”), which you can see
with class(ydat) . If you have other “regular” data frames in your
workspace, the as_tibble() function will convert it into the special dplyr
tbl that displays nicely (e.g.: iris <- as_tibble(iris) ). You don’t have to
turn all your data frame objects into tibbles, but it does make working with
large datasets a bit easier.
You can read more about tibbles in Tibbles chapter in R for Data Science
(http://r4ds.had.co.nz/tibbles.html) or in the tibbles vignette (https://cran.r-
project.org/web/packages/tibble/vignettes/tibble.html). They keep most of
the features of data frames, and drop the features that used to be
convenient but are now frustrating (i.e. converting character vectors to
factors). You can read more about the di erences between data frames and
tibbles in this section of the tibbles vignette (https://cran.r-
project.org/web/packages/tibble/vignettes/tibble.html#tibbles-vs-data-
frames), but the major convenience for us concerns printing (aka
displaying) a tibble to the screen. When you print (i.e., display) a tibble, it
only shows the rst 10 rows and all the columns that t on one screen. It
also prints an abbreviated description of the column type. You can control
the default appearance with options:
dplyr verbs
The dplyr package gives you a handful of useful verbs for managing data.
On their own they don’t do anything that base R can’t do. Here are some of
the single-table verbs we’ll be working with in this lesson (single-table
meaning that they only work on a single table – contrast that to two-table
verbs used for joining data together, which we’ll cover in a later lesson).
1. filter()
2. select()
3. mutate()
4. arrange()
5. summarize()
6. group_by()
They all take a data frame or tibble as their input for the rst argument, and
they all return a data frame or tibble as output.
lter()
If you want to lter rows of the data where some condition is true, use the
filter() function.
1. The rst argument is the data frame you want to lter, e.g.
filter(mydata, ... .
2. The second argument is a condition you must satisfy, e.g.
filter(ydat, symbol == "LEU1") . If you want to satisfy all of multiple
conditions, you can use the “and” operator, & . The “or” operator | (the
pipe character, usually shift-backslash) will return a subset that meet
any of the conditions.
== : Equal to
!= : Not equal to
> , >= : Greater than, greater than or equal to
< , <= : Less than, less than or equal to
Let’s try it out. For this to work you have to have already loaded the dplyr
package. Let’s take a look at LEU1
(http://www.yeastgenome.org/locus/Leu1/overview), a gene involved in
leucine synthesis.
## # A tibble: 6 x 7
## symbol systematic_name nutrient rate expression
bp
## <chr> <chr> <chr> <dbl> <dbl>
<chr>
## 1 LEU1 YGL009C Glucose 0.3 0.03 leucine biosyn
thesis
## 2 LEU1 YGL009C Ammonia 0.3 -0.22 leucine biosyn
thesis
## 3 LEU1 YGL009C Phosphate 0.3 -0.07 leucine biosyn
thesis
## 4 LEU1 YGL009C Sulfate 0.3 -0.76 leucine biosyn
thesis
## 5 LEU1 YGL009C Leucine 0.3 0.87 leucine biosyn
thesis
## 6 LEU1 YGL009C Uracil 0.3 -0.16 leucine biosyn
thesis
## # ... with 1 more variables: mf <chr>
# Show only stats for LEU1 and Leucine depletion.
# LEU1 expression starts off high and drops
filter(ydat, symbol=="LEU1" & nutrient=="Leucine")
## # A tibble: 6 x 7
## symbol systematic_name nutrient rate expression
bp
## <chr> <chr> <chr> <dbl> <dbl>
<chr>
## 1 LEU1 YGL009C Leucine 0.05 3.84 leucine biosynt
hesis
## 2 LEU1 YGL009C Leucine 0.10 3.36 leucine biosynt
hesis
## 3 LEU1 YGL009C Leucine 0.15 3.24 leucine biosynt
hesis
## 4 LEU1 YGL009C Leucine 0.20 2.84 leucine biosynt
hesis
## 5 LEU1 YGL009C Leucine 0.25 2.04 leucine biosynt
hesis
## 6 LEU1 YGL009C Leucine 0.30 0.87 leucine biosynt
hesis
## # ... with 1 more variables: mf <chr>
Let’s look at this graphically. Don’t worry about what these commands are
doing just yet - we’ll cover that later on when we talk about ggplot2. Here’s
I’m taking the ltered dataset containing just expression estimates for LEU1
where I have 36 rows (one for each of 6 nutrients × 6 growth rates), and I’m
piping that dataset to the plotting function, where I’m plotting rate on the x-
axis, expression on the y-axis, mapping the value of nutrient to the color,
and using a line plot to display the data.
library(ggplot2)
filter(ydat, symbol=="LEU1") %>%
ggplot(aes(rate, expression, colour=nutrient)) + geom_line(lwd=1.5)
Look closely at that! LEU1 is highly expressed when starved of leucine
because the cell has to synthesize its own! And as the amount of leucine in
the environment (the growth rate) increases, the cell can worry less about
synthesizing leucine, so LEU1 expression goes back down. Consequently the
cell can devote more energy into other functions, and we see other genes’
expression very slightly raising.
EXERCISE 1
1. Display the data where the gene ontology biological process (the bp
variable) is “leucine biosynthesis” (case-sensitive) and the limiting
nutrient was Leucine. (Answer should return a 24-by-7 data frame – 4
genes × 6 growth rates).
2. Gene/rate combinations had high expression (in the top 1% of
expressed genes)? Hint: see ?quantile and try
quantile(ydat$expression, probs=.99) to see the expression value
which is higher than 99% of all the data, then filter() based on that.
Try wrapping your answer with a View() function so you can see the
whole thing. What does it look like those genes are doing? Answer
should return a 1971-by-7 data frame.
1. The ydat object in our workspace is not being modi ed directly. That
is, we can filter(ydat, ...) , and a result is returned to the screen,
but ydat remains the same. This e ect is similar to what we
demonstrated in our rst session.
# Weight is still 50
weight
Note that this is di erent than saving your entire workspace to an Rdata le,
which would contain all the objects we’ve created (weight, ydat, leudat, etc).
select()
The filter() function allows you to return only certain rows matching a
condition. The select() function returns only certain columns. The rst
argument is the data, and subsequent arguments are the columns you
want.
## # A tibble: 198,430 x 2
## symbol systematic_name
## <chr> <chr>
## 1 SFB2 YNL049C
## 2 <NA> YNL095C
## 3 QRI7 YDL104C
## 4 CFT2 YLR115W
## 5 SSO2 YMR183C
## 6 PSP2 YML017W
## 7 RIB2 YOL066C
## 8 VMA13 YPR036W
## 9 EDC3 YEL015W
## 10 VPS5 YOR069W
## # ... with 198,420 more rows
# Alternatively, just remove columns. Remove the bp and mf columns.
select(ydat, -bp, -mf)
## # A tibble: 198,430 x 5
## symbol systematic_name nutrient rate expression
## <chr> <chr> <chr> <dbl> <dbl>
## 1 SFB2 YNL049C Glucose 0.05 -0.24
## 2 <NA> YNL095C Glucose 0.05 0.28
## 3 QRI7 YDL104C Glucose 0.05 -0.02
## 4 CFT2 YLR115W Glucose 0.05 -0.33
## 5 SSO2 YMR183C Glucose 0.05 0.05
## 6 PSP2 YML017W Glucose 0.05 -0.69
## 7 RIB2 YOL066C Glucose 0.05 -0.55
## 8 VMA13 YPR036W Glucose 0.05 -0.75
## 9 EDC3 YEL015W Glucose 0.05 -0.24
## 10 VPS5 YOR069W Glucose 0.05 -0.16
## # ... with 198,420 more rows
## # A tibble: 198,430 x 7
## symbol systematic_name nutrient rate expression
## <chr> <chr> <chr> <dbl> <dbl>
## 1 SFB2 YNL049C Glucose 0.05 -0.24
## 2 <NA> YNL095C Glucose 0.05 0.28
## 3 QRI7 YDL104C Glucose 0.05 -0.02
## 4 CFT2 YLR115W Glucose 0.05 -0.33
## 5 SSO2 YMR183C Glucose 0.05 0.05
## 6 PSP2 YML017W Glucose 0.05 -0.69
## 7 RIB2 YOL066C Glucose 0.05 -0.55
## 8 VMA13 YPR036W Glucose 0.05 -0.75
## 9 EDC3 YEL015W Glucose 0.05 -0.24
## 10 VPS5 YOR069W Glucose 0.05 -0.16
## # ... with 198,420 more rows, and 2 more variables: bp <chr>, mf <
chr>
Notice above how the original data doesn’t change. We’re selecting out only
certain columns of interest and throwing away columns we don’t care
about. If we wanted to keep this data, we would need to reassign the result
of the select() operation to a new object. Let’s make a new object called
nogo that does not contain the GO annotations. Notice again how the
original data is unchanged.
## # A tibble: 198,430 x 5
## symbol systematic_name nutrient rate expression
## <chr> <chr> <chr> <dbl> <dbl>
## 1 SFB2 YNL049C Glucose 0.05 -0.24
## 2 <NA> YNL095C Glucose 0.05 0.28
## 3 QRI7 YDL104C Glucose 0.05 -0.02
## 4 CFT2 YLR115W Glucose 0.05 -0.33
## 5 SSO2 YMR183C Glucose 0.05 0.05
## 6 PSP2 YML017W Glucose 0.05 -0.69
## 7 RIB2 YOL066C Glucose 0.05 -0.55
## 8 VMA13 YPR036W Glucose 0.05 -0.75
## 9 EDC3 YEL015W Glucose 0.05 -0.24
## 10 VPS5 YOR069W Glucose 0.05 -0.16
## # ... with 198,420 more rows
## # A tibble: 6 x 5
## symbol systematic_name nutrient rate expression
## <chr> <chr> <chr> <dbl> <dbl>
## 1 LEU1 YGL009C Glucose 0.05 -1.12
## 2 LEU1 YGL009C Ammonia 0.05 -0.76
## 3 LEU1 YGL009C Phosphate 0.05 -0.81
## 4 LEU1 YGL009C Sulfate 0.05 -1.57
## 5 LEU1 YGL009C Leucine 0.05 3.84
## 6 LEU1 YGL009C Uracil 0.05 -2.07
# Notice how the original data is unchanged - still have all 7 column
s
ydat
## # A tibble: 198,430 x 7
## symbol systematic_name nutrient rate expression
## <chr> <chr> <chr> <dbl> <dbl>
## 1 SFB2 YNL049C Glucose 0.05 -0.24
## 2 <NA> YNL095C Glucose 0.05 0.28
## 3 QRI7 YDL104C Glucose 0.05 -0.02
## 4 CFT2 YLR115W Glucose 0.05 -0.33
## 5 SSO2 YMR183C Glucose 0.05 0.05
## 6 PSP2 YML017W Glucose 0.05 -0.69
## 7 RIB2 YOL066C Glucose 0.05 -0.55
## 8 VMA13 YPR036W Glucose 0.05 -0.75
## 9 EDC3 YEL015W Glucose 0.05 -0.24
## 10 VPS5 YOR069W Glucose 0.05 -0.16
## # ... with 198,420 more rows, and 2 more variables: bp <chr>, mf <
chr>
mutate()
The mutate() function adds new columns to the data. Remember, it doesn’t
actually modify the data frame you’re operating on, and the result is
transient unless you assign it to a new object or reassign it back to itself
(generally, not always a good practice).
The expression level reported here is the log2 of the sample signal divided
by the signal in the reference channel, where the reference RNA for all
samples was taken from the glucose-limited chemostat grown at a dilution
rate of 0.25 h−1 . Let’s mutate this data to add a new variable called “signal”
that’s the actual raw signal ratio instead of the log-transformed signal.
mutate(nogo, signal=2^expression)
Mutate has a nice little feature too in that it’s “lazy.” You can mutate and add
one variable, then continue mutating to add more variables based on that
variable. Let’s make another column that’s the square root of the signal
ratio.
Again, don’t worry about the code here to make the plot – we’ll learn about
this later. Why do you think we log-transform the data prior to analysis?
library(tidyr)
mutate(nogo, signal=2^expression, sigsr=sqrt(signal)) %>%
gather(unit, value, expression:sigsr) %>%
ggplot(aes(value)) + geom_histogram(bins=100) + facet_wrap(~unit, s
cales="free")
arrange()
The arrange() function does what it sounds like. It takes a data frame or tbl
and arranges (or sorts) by column(s) of interest. The rst argument is the
data, and subsequent arguments are columns to sort on. Use the desc()
function to arrange by descending.
## # A tibble: 198,430 x 7
## symbol systematic_name nutrient rate expression
bp
## <chr> <chr> <chr> <dbl> <dbl>
<chr>
## 1 AAC1 YMR056C Glucose 0.05 1.50 aerobic respir
ation*
## 2 AAC1 YMR056C Glucose 0.10 1.54 aerobic respir
ation*
## 3 AAC1 YMR056C Glucose 0.15 1.16 aerobic respir
ation*
## 4 AAC1 YMR056C Glucose 0.20 1.04 aerobic respir
ation*
## 5 AAC1 YMR056C Glucose 0.25 0.84 aerobic respir
ation*
## 6 AAC1 YMR056C Glucose 0.30 0.01 aerobic respir
ation*
## 7 AAC1 YMR056C Ammonia 0.05 0.80 aerobic respir
ation*
## 8 AAC1 YMR056C Ammonia 0.10 1.47 aerobic respir
ation*
## 9 AAC1 YMR056C Ammonia 0.15 0.97 aerobic respir
ation*
## 10 AAC1 YMR056C Ammonia 0.20 0.76 aerobic respir
ation*
## # ... with 198,420 more rows, and 1 more variables: mf <chr>
## # A tibble: 198,430 x 7
## symbol systematic_name nutrient rate expression
## <chr> <chr> <chr> <dbl> <dbl>
## 1 GAP1 YKR039W Ammonia 0.05 6.64
## 2 DAL5 YJR152W Ammonia 0.05 6.64
## 3 GAP1 YKR039W Ammonia 0.10 6.64
## 4 DAL5 YJR152W Ammonia 0.10 6.64
## 5 DAL5 YJR152W Ammonia 0.15 6.64
## 6 DAL5 YJR152W Ammonia 0.20 6.64
## 7 DAL5 YJR152W Ammonia 0.25 6.64
## 8 DAL5 YJR152W Ammonia 0.30 6.64
## 9 GIT1 YCR098C Phosphate 0.05 6.64
## 10 PHM6 YDR281C Phosphate 0.05 6.64
## # ... with 198,420 more rows, and 2 more variables: bp <chr>, mf <
chr>
EXERCISE 2
1. First, re-run the command you used above to lter the data for genes
involved in the “leucine biosynthesis” biological process and where the
limiting nutrient is Leucine.
2. Wrap this entire ltered result with a call to arrange() where you’ll
arrange the result of #1 by the gene symbol.
3. Wrap this entire result in a View() statement so you can see the entire
result.
summarize()
The summarize() function summarizes multiple values to a single value. On
its own the summarize() function doesn’t seem to be all that useful. The
dplyr package provides a few convenience functions called n() and
n_distinct() that tell you the number of observations or the number of
distinct values of a particular variable.
Notice that summarize takes a data frame and returns a data frame. In this
case it’s a 1x1 data frame with a single row and a single column. The name
of the column, by default is whatever the expression was used to
summarize the data. This usually isn’t pretty, and if we wanted to work with
this resulting data frame later on, we’d want to name that returned value
something easier to deal with.
## # A tibble: 1 x 1
## `mean(expression)`
## <dbl>
## 1 0.00337
## # A tibble: 1 x 1
## meanexp
## <dbl>
## 1 0.00337
# Measure the correlation between rate and expression
summarize(ydat, r=cor(rate, expression))
## # A tibble: 1 x 1
## r
## <dbl>
## 1 -0.022
## # A tibble: 1 x 1
## `n()`
## <int>
## 1 198430
## # A tibble: 1 x 1
## `n_distinct(symbol)`
## <int>
## 1 4211
group_by()
We saw that summarize() isn’t that useful on its own. Neither is group_by()
All this does is takes an existing data frame and coverts it into a grouped
data frame where operations are performed by group.
ydat
## # A tibble: 198,430 x 7
## symbol systematic_name nutrient rate expression
## <chr> <chr> <chr> <dbl> <dbl>
## 1 SFB2 YNL049C Glucose 0.05 -0.24
## 2 <NA> YNL095C Glucose 0.05 0.28
## 3 QRI7 YDL104C Glucose 0.05 -0.02
## 4 CFT2 YLR115W Glucose 0.05 -0.33
## 5 SSO2 YMR183C Glucose 0.05 0.05
## 6 PSP2 YML017W Glucose 0.05 -0.69
## 7 RIB2 YOL066C Glucose 0.05 -0.55
## 8 VMA13 YPR036W Glucose 0.05 -0.75
## 9 EDC3 YEL015W Glucose 0.05 -0.24
## 10 VPS5 YOR069W Glucose 0.05 -0.16
## # ... with 198,420 more rows, and 2 more variables: bp <chr>, mf <
chr>
group_by(ydat, nutrient)
## # A tibble: 198,430 x 7
## # Groups: nutrient [6]
## symbol systematic_name nutrient rate expression
## <chr> <chr> <chr> <dbl> <dbl>
## 1 SFB2 YNL049C Glucose 0.05 -0.24
## 2 <NA> YNL095C Glucose 0.05 0.28
## 3 QRI7 YDL104C Glucose 0.05 -0.02
## 4 CFT2 YLR115W Glucose 0.05 -0.33
## 5 SSO2 YMR183C Glucose 0.05 0.05
## 6 PSP2 YML017W Glucose 0.05 -0.69
## 7 RIB2 YOL066C Glucose 0.05 -0.55
## 8 VMA13 YPR036W Glucose 0.05 -0.75
## 9 EDC3 YEL015W Glucose 0.05 -0.24
## 10 VPS5 YOR069W Glucose 0.05 -0.16
## # ... with 198,420 more rows, and 2 more variables: bp <chr>, mf <
chr>
The real power comes in where group_by() and summarize() are used
together. First, write the group_by() statement. Then wrap the result of that
with a call to summarize() .
## # A tibble: 4,211 x 2
## symbol meanexp
## <chr> <dbl>
## 1 AAC1 0.52889
## 2 AAC3 -0.21629
## 3 AAD10 0.43833
## 4 AAD14 -0.07167
## 5 AAD16 0.24194
## 6 AAD4 -0.79167
## 7 AAD6 0.29028
## 8 AAH1 0.04611
## 9 AAP1 -0.00361
## 10 AAP1' -0.42139
## # ... with 4,201 more rows
# Get the correlation between rate and expression for each nutrient
# group_by(ydat, nutrient)
summarize(group_by(ydat, nutrient), r=cor(rate, expression))
## # A tibble: 6 x 2
## nutrient r
## <chr> <dbl>
## 1 Ammonia -0.0175
## 2 Glucose -0.0112
## 3 Leucine -0.0384
## 4 Phosphate -0.0194
## 5 Sulfate -0.0166
## 6 Uracil -0.0353
Here’s the simplest way to use it. Remember the tail() function. It expects
a data frame as input, and the next argument is the number of lines to
print. These two commands are identical:
tail(ydat, 5)
## # A tibble: 5 x 7
## symbol systematic_name nutrient rate expression
## <chr> <chr> <chr> <dbl> <dbl>
## 1 KRE1 YNL322C Uracil 0.3 0.28
## 2 MTL1 YGR023W Uracil 0.3 0.27
## 3 KRE9 YJL174W Uracil 0.3 0.43
## 4 UTH1 YKR042W Uracil 0.3 0.19
## 5 <NA> YOL111C Uracil 0.3 0.04
## # ... with 2 more variables: bp <chr>, mf <chr>
## # A tibble: 5 x 7
## symbol systematic_name nutrient rate expression
## <chr> <chr> <chr> <dbl> <dbl>
## 1 KRE1 YNL322C Uracil 0.3 0.28
## 2 MTL1 YGR023W Uracil 0.3 0.27
## 3 KRE9 YJL174W Uracil 0.3 0.43
## 4 UTH1 YKR042W Uracil 0.3 0.19
## 5 <NA> YOL111C Uracil 0.3 0.04
## # ... with 2 more variables: bp <chr>, mf <chr>
filter(ydat, nutrient=="Leucine")
## # A tibble: 33,178 x 7
## symbol systematic_name nutrient rate expression
## <chr> <chr> <chr> <dbl> <dbl>
## 1 SFB2 YNL049C Leucine 0.05 0.18
## 2 <NA> YNL095C Leucine 0.05 0.16
## 3 QRI7 YDL104C Leucine 0.05 -0.30
## 4 CFT2 YLR115W Leucine 0.05 -0.27
## 5 SSO2 YMR183C Leucine 0.05 -0.59
## 6 PSP2 YML017W Leucine 0.05 -0.17
## 7 RIB2 YOL066C Leucine 0.05 -0.02
## 8 VMA13 YPR036W Leucine 0.05 -0.11
## 9 EDC3 YEL015W Leucine 0.05 0.12
## 10 VPS5 YOR069W Leucine 0.05 -0.20
## # ... with 33,168 more rows, and 2 more variables: bp <chr>, mf <c
hr>
## # A tibble: 33,178 x 7
## symbol systematic_name nutrient rate expression
## <chr> <chr> <chr> <dbl> <dbl>
## 1 SFB2 YNL049C Leucine 0.05 0.18
## 2 <NA> YNL095C Leucine 0.05 0.16
## 3 QRI7 YDL104C Leucine 0.05 -0.30
## 4 CFT2 YLR115W Leucine 0.05 -0.27
## 5 SSO2 YMR183C Leucine 0.05 -0.59
## 6 PSP2 YML017W Leucine 0.05 -0.17
## 7 RIB2 YOL066C Leucine 0.05 -0.02
## 8 VMA13 YPR036W Leucine 0.05 -0.11
## 9 EDC3 YEL015W Leucine 0.05 0.12
## 10 VPS5 YOR069W Leucine 0.05 -0.20
## # ... with 33,168 more rows, and 2 more variables: bp <chr>, mf <c
hr>
ydat
then mutate() to round the result of the above calculation to two signi cant
digits
mutate(summarize(group_by(filter(ydat, bp == "leucine biosynthesis"),
nutrient),
r = cor(rate, expression)), r = round(r, 2))
arrange(
mutate(
summarize(
group_by(
filter(ydat, bp=="leucine biosynthesis"),
nutrient),
r=cor(rate, expression)),
r=round(r, 2)),
r)
## # A tibble: 6 x 2
## nutrient r
## <chr> <dbl>
## 1 Leucine -0.58
## 2 Glucose -0.04
## 3 Ammonia 0.16
## 4 Sulfate 0.33
## 5 Phosphate 0.44
## 6 Uracil 0.58
Now compare that with the mental process of what you’re actually trying to
accomplish. The way you would do this without pipes is completely inside-
out and backwards from the way you express in words and in thought what
you want to do. The pipe operator %>% allows you to pass the output data
frame from one function to the input data frame to another function.
Nesting functions versus piping
This is how we would do that in code. It’s as simple as replacing the word
“then” in words to the symbol %>% in code. (There’s a keyboard shortcut that
I’ll use frequently to insert the %>% sequence – you can see what it is by
clicking the Tools menu in RStudio, then selecting Keyboard Shortcut Help. On
Mac, it’s CMD-SHIFT-M.)
ydat %>%
filter(bp=="leucine biosynthesis") %>%
group_by(nutrient) %>%
summarize(r=cor(rate, expression)) %>%
mutate(r=round(r,2)) %>%
arrange(r)
## # A tibble: 6 x 2
## nutrient r
## <chr> <dbl>
## 1 Leucine -0.58
## 2 Glucose -0.04
## 3 Ammonia 0.16
## 4 Sulfate 0.33
## 5 Phosphate 0.44
## 6 Uracil 0.58
Piping exercises
EXERCISE 3
Show the limiting nutrient and expression values for the gene ADH2 when
the growth rate is restricted to 0.05. Hint: 2 pipes: filter and select .
## # A tibble: 6 x 2
## nutrient expression
## <chr> <dbl>
## 1 Glucose 6.28
## 2 Ammonia 0.55
## 3 Phosphate -4.60
## 4 Sulfate -1.18
## 5 Leucine 4.15
## 6 Uracil 0.63
What are the four most highly expressed genes when the growth rate is
restricted to 0.05 by restricting glucose? Show only the symbol, expression
value, and GO terms. Hint: 4 pipes: filter , arrange , head , and select .
## # A tibble: 4 x 4
## symbol expression bp
mf
## <chr> <dbl> <chr> <
chr>
## 1 ADH2 6.28 fermentation* alcohol dehydrogenase acti
vity
## 2 HSP26 5.86 response to stress* unfolded protein bin
ding
## 3 MLS1 5.64 glyoxylate cycle malate synthase acti
vity
## 4 HXT5 5.56 hexose transport glucose transporter activ
ity*
When the growth rate is restricted to 0.05, what is the average expression
level across all genes in the “response to stress” biological process,
separately for each limiting nutrient? What about genes in the “protein
biosynthesis” biological process? Hint: 3 pipes: filter , group_by ,
summarize .
## # A tibble: 6 x 2
## nutrient meanexp
## <chr> <dbl>
## 1 Ammonia 0.943
## 2 Glucose 0.743
## 3 Leucine 0.811
## 4 Phosphate 0.981
## 5 Sulfate 0.743
## 6 Uracil 0.731
## # A tibble: 6 x 2
## nutrient meanexp
## <chr> <dbl>
## 1 Ammonia -1.613
## 2 Glucose -0.691
## 3 Leucine -0.574
## 4 Phosphate -0.750
## 5 Sulfate -0.913
## 6 Uracil -0.880
EXERCISE 4
## # A tibble: 1 x 1
## `n_distinct(mf)`
## <int>
## 1 1086
## # A tibble: 10 x 2
## mf n
## <chr> <int>
## 1 molecular function unknown 886
## 2 structural constituent of ribosome 185
## 3 protein binding 107
## 4 RNA binding 63
## 5 protein binding* 53
## 6 DNA binding* 44
## 7 structural molecule activity 43
## 8 GTPase activity 40
## 9 structural constituent of cytoskeleton 39
## 10 transcription factor activity 38
How many distinct genes are there where we know what process the gene is
involved in but we don’t know what it does? Hint: 3 pipes; filter where
bp!="biological process unknown" & mf=="molecular function unknown" ,
and after select ing columns of interest, pipe the output to distinct() . The
answer should be 737, and here are a few:
## # A tibble: 737 x 3
## symbol
bp
## <chr>
<chr>
## 1 SFB2 ER to Golgi tr
ansport
## 2 EDC3 deadenylylation-independent de
capping
## 3 PER1 response to unfolded p
rotein*
## 4 PEX25 peroxisome organization and biog
enesis*
## 5 BNI5 cytok
inesis*
## 6 CSN12 adaptation to pheromone during conjugation with cellular
fusion
## 7 SEC39 secretory
pathway
## 8 ABC1 ubiquinone biosy
nthesis
## 9 PRP46 nuclear mRNA splicing, via spli
ceosome
## 10 MAM3 mitochondrion organization and biog
enesis*
## # ... with 727 more rows, and 1 more variables: mf <chr>
## # A tibble: 5,257 x 3
## # Groups: nutrient [6]
## nutrient bp meanexp
## <chr> <chr> <dbl>
## 1 Ammonia allantoate transport 6.64
## 2 Ammonia amino acid transport* 6.64
## 3 Phosphate glycerophosphodiester transport 6.64
## 4 Glucose fermentation* 6.28
## 5 Ammonia allantoin transport 5.56
## 6 Glucose glyoxylate cycle 5.29
## 7 Ammonia proline catabolism* 5.14
## 8 Ammonia urea transport 5.14
## 9 Glucose oxygen and reactive oxygen species metabolism 5.04
## 10 Glucose fumarate transport* 5.03
## # ... with 5,247 more rows
Let’s try to further process that result to get only the top three most
upregulated biolgocal processes for each limiting nutrient. Google search
“dplyr rst result within group.” You’ll need a filter(row_number()......) in
there somewhere. Hint: 5 pipes: filter , group_by , summarize , arrange ,
filter(row_number()... . Note: dplyr’s pipe syntax used to be %.% before it
changed to %>% . So when looking around, you might still see some people
use the old syntax. Now if you try to use the old syntax, you’ll get a
deprecation warning.
## # A tibble: 18 x 3
## # Groups: nutrient [6]
## nutrient bp meanexp
## <chr> <chr> <dbl>
## 1 Ammonia allantoate transport 6.64
## 2 Ammonia amino acid transport* 6.64
## 3 Phosphate glycerophosphodiester transport 6.64
## 4 Glucose fermentation* 6.28
## 5 Ammonia allantoin transport 5.56
## 6 Glucose glyoxylate cycle 5.29
## 7 Glucose oxygen and reactive oxygen species metabolism 5.04
## 8 Uracil fumarate transport* 4.32
## 9 Phosphate vacuole fusion, non-autophagic 4.20
## 10 Leucine fermentation* 4.15
## 11 Phosphate regulation of cell redox homeostasis* 4.03
## 12 Leucine fumarate transport* 3.72
## 13 Leucine glyoxylate cycle 3.65
## 14 Sulfate protein ubiquitination 3.40
## 15 Sulfate fumarate transport* 3.27
## 16 Uracil pyridoxine metabolism 3.11
## 17 Uracil asparagine catabolism* 3.06
## 18 Sulfate sulfur amino acid metabolism* 2.69
There’s a slight problem with the examples above. We’re getting the average
expression of all the biological processes separately by each nutrient. But
some of these biological processes only have a single gene in them! If we
tried to do the same thing to get the correlation between rate and
expression, the calculation would work, but we’d get a warning about a
standard deviation being zero. The correlation coe cient value that results
is NA , i.e., missing. While we’re summarizing the correlation between rate
and expression, let’s also show the number of distinct genes within each
grouping.
ydat %>%
group_by(nutrient, bp) %>%
summarize(r=cor(rate, expression), ngenes=n_distinct(symbol))
## # A tibble: 5,286 x 4
## # Groups: nutrient [?]
## nutrient bp r ngenes
## <chr> <chr> <dbl> <int>
## 1 Ammonia 'de novo' IMP biosynthesis* 0.3125 8
## 2 Ammonia 'de novo' pyrimidine base biosynthesis -0.0482 3
## 3 Ammonia 'de novo' pyrimidine base biosynthesis* 0.1670 4
## 4 Ammonia 35S primary transcript processing 0.5080 13
## 5 Ammonia 35S primary transcript processing* 0.4240 30
## 6 Ammonia acetate biosynthesis 0.4677 1
## 7 Ammonia acetate metabolism 0.9291 1
## 8 Ammonia acetate metabolism* -0.6855 1
## 9 Ammonia acetyl-CoA biosynthesis -0.8512 1
## 10 Ammonia acetyl-CoA biosynthesis from pyruvate 0.0951 1
## # ... with 5,276 more rows
Take the above code and continue to process the result to show only results
where the process has at least 5 genes. Add a column corresponding to the
absolute value of the correlation coe cient, and show for each nutrient the
singular process with the highest correlation between rate and expression,
regardless of direction. Hint: 4 more pipes: filter , mutate , arrange , and
filter again with row_number()==1 . Ignore the warning.
## # A tibble: 6 x 5
## # Groups: nutrient [6]
## nutrient bp r ngen
es absr
## <chr> <chr> <dbl> <in
t> <dbl>
## 1 Glucose telomerase-independent telomere maintenance -0.95
7 0.95
## 2 Ammonia telomerase-independent telomere maintenance -0.91
7 0.91
## 3 Leucine telomerase-independent telomere maintenance -0.90
7 0.90
## 4 Phosphate telomerase-independent telomere maintenance -0.90
7 0.90
## 5 Uracil telomerase-independent telomere maintenance -0.81
7 0.81
## 6 Sulfate translational elongation* 0.79
5 0.79
Homework
Looking for more practice? Try this homework assignment (r-dplyr-
homework.html).
This work is licensed under the CC BY-NC-SA 4.0 Creative Commons License
(https://creativecommons.org/licenses/by-nc-sa/4.0/).
For more information, visit data.hsl.virginia.edu (http://data.hsl.virginia.edu/).
(https://twitter.com/strnr) (https://github.com/bioconnector/workshops)