R Code Snippets
R Code Snippets
# Here's the new bit, but using the same approach we've been using this whole time.
top_countries <- filter(pack_sum, countries > 60)
result1 <- arrange(top_countries, desc(countries), avg_bytes)
#Chaining
result3 <- cran %>%
group_by(package) %>%
summarize(count = n(), unique = n_distinct(ip_id), countries = n_distinct(country),
avg_bytes = mean(size)
) %>%
filter(countries > 60) %>%
arrange(desc(countries), avg_bytes)
print(result3)
#More Chaining
cran %>%
select(ip_id, country, package, size) %>%
mutate(size_mb = size / 2^20) %>%
filter(size_mb <= 0.5) %>%
arrange(desc(size_mb))
Any dataset that doesn't satisfy these conditions is considered 'messy'
data. Therefore, all of the following are characteristics of messy
data, EXCEPT...6
1:
2:
3:
4:
5:
6:
> students
grade male female
1
A
1
5
2
B
5
0
3
C
5
2
4
D
5
5
5
E
7
4
This dataset actually has three variables: grade, sex, and count. The
first variable, grade, is already a column, so that should remain as it
is. The second variable, sex, is captured by the second and third
column headings. The third variable, count, is the number of students
for each combination of grade and sex.
> gather(students, sex, count, -grade)
grade
sex count
1
A
male
1
2
B
male
5
3
C
male
5
4
D
male
5
5
E
male
7
6
A female
5
7
B female
0
8
C female
2
9
D female
5
10
E female
4
It's important to understand what each argument to gather() means. The
data argument, students, gives the name of the original dataset. The
key and value arguments -- sex and count, respectively give the
column names for our tidy dataset. The final argument, -grade, says
that we want to gather all columns EXCEPT the grade column (since grade
is already a proper column variable.)
The second messy data case we'll look at is when multiple variables are
stored in one column. Type students2 to see an example of this.
> students2
grade male_1 female_1 male_2 female_2
1
A
3
4
3
4
2
B
6
4
3
5
3
C
7
4
3
8
4
D
4
0
8
1
5
E
1
1
2
7
This dataset is similar to the first, except now there are two separate
classes, 1 and 2, and we have total counts for each sex within each
class. students2 suffers from the same messy data problem of having
column headers that are values (male_1, female_1, etc.) and not
variable names (sex, class, and count).
This dataset is similar to the first, except now there are two separate
classes, 1 and 2, and we have total counts for each sex within each
class. students2 suffers from the same messy data problem of having
column headers that are values (male_1, female_1, etc.) and not
variable names (sex, class, and count).
However, it also has multiple variables stored in each column (sex and
class), which is another common symptom of messy data. Tidying this
dataset will be a two step process.
Let's start by using gather() to stack the columns of students2, like
we just did with students. This time, name the 'key' column sex_class
and the 'value' column count. Save the result to a new variable called
res. Consult ?gather again if you need help.
> res <- gather(students2, sex_class, count, -grade)
> res
grade sex_class count
1
A
male_1
3
2
B
male_1
6
3
C
male_1
7
4
D
male_1
4
5
E
male_1
1
6
A female_1
4
7
B female_1
4
8
C female_1
4
9
D female_1
0
10
E female_1
1
11
A
male_2
3
12
B
male_2
3
13
C
male_2
3
14
D
male_2
8
15
E
male_2
2
16
A female_2
4
17
B female_2
5
18
C female_2
8
19
D female_2
1
20
E female_2
7
That got us half way to tidy data, but we still have two different
variables, sex and class, stored together in the sex_class column.
grade
A
B
C
D
E
A
B
C
D
E
A
B
C
D
E
A
B
C
D
E
A third symptom of messy data is when variables are stored in both rows
and columns. students3 provides an example of this. Print students3 to
the console.
> students3
name
test class1 class2 class3 class4 class5
1 Sally midterm
A
<NA>
B
<NA>
<NA>
2 Sally
final
C
<NA>
C
<NA>
<NA>
3
Jeff midterm
<NA>
D
<NA>
A
<NA>
4
Jeff
final
<NA>
E
<NA>
C
<NA>
5 Roger midterm
<NA>
C
<NA>
<NA>
B
6 Roger
final
<NA>
A
<NA>
<NA>
A
7 Karen midterm
<NA>
<NA>
C
A
<NA>
8 Karen
final
<NA>
<NA>
C
A
<NA>
9 Brian midterm
B
<NA>
<NA>
<NA>
A
10 Brian
final
B
<NA>
<NA>
<NA>
C
In students3, we have midterm and final exam grades for five students,
each of whom were enrolled in exactly two of five possible classes.
The first variable, name, is already a column and should remain as it
is. The headers of the last five columns, class1 through class5, are
all different values of what should be a class variable. The values in
the test column, midterm and final, should each be its own variable
containing the respective grades for each student.
students3 %>%
gather( class,grade , class1:class5 , na.rm= TRUE) %>%
print
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
name
Sally
Sally
Brian
Brian
Jeff
Jeff
Roger
Roger
Sally
Sally
Karen
Karen
Jeff
Jeff
Karen
Karen
Roger
Roger
Brian
Brian
test
midterm
final
midterm
final
midterm
final
midterm
final
midterm
final
midterm
final
midterm
final
midterm
final
midterm
final
midterm
final
class grade
class1
A
class1
C
class1
B
class1
B
class2
D
class2
E
class2
C
class2
A
class3
B
class3
C
class3
C
class3
C
class4
A
class4
C
class4
A
class4
A
class5
B
class5
A
class5
A
class5
C
students3 %>%
gather(class, grade, class1:class5, na.rm = TRUE) %>%
spread(test ,grade ) %>%
print
1
2
3
4
5
6
7
8
9
10
name
Brian
Brian
Jeff
Jeff
Karen
Karen
Roger
Roger
Sally
Sally
The fourth messy data problem we'll look at occurs when multiple
observational units are stored in the same table. students4 presents an
example of this. Take a look at the data now.
> students4
id name sex class midterm final
1 168 Brian
F
1
B
B
2 168 Brian
F
5
A
C
3 588 Sally
M
1
A
C
4 588 Sally
M
3
B
C
5 710 Jeff
M
2
D
E
6 710 Jeff
M
4
A
C
7 731 Roger
F
2
C
A
8 731 Roger
F
5
B
A
9 908 Karen
M
3
C
C
10 908 Karen
M
4
A
A
students4 is almost the same as our tidy version of students3. The only
difference is that students4 provides a unique id for each student, as
well as his or her sex (M = male; F = female).
Our solution will be to break students4 into two separate tables -- one
containing basic student information (id, name, and sex) and the other
containing grades (id, class, midterm, final).
id
168
168
588
588
710
710
731
731
908
908
name sex
Brian
F
Brian
F
Sally
M
Sally
M
Jeff
M
Jeff
M
Roger
F
Roger
F
Karen
M
Karen
M
id
168
588
710
731
908
name sex
Brian
F
Sally
M
Jeff
M
Roger
F
Karen
M
Now, using the script I just opened for you, create a second table
called gradebook using the id, class, midterm, and final columns (in
that order).
> failed
name class final
1 Brian
5
C
2 Sally
1
C
3 Sally
3
C
4 Jeff
2
E
5 Jeff
4
C
6 Karen
3
C
> sat
Source: local data frame [6 x 10]
score_range read_male read_fem read_total math_male math_fem
math_total write_male write_fem
1
700-800
40151
38898
79049
74461
46040
120501
31574
39101
2
600-690
121950
126084
248034
162564
133954
296518
100963
125368
3
500-590
227141
259553
486694
233141
257678
490819
202326
247239
4
400-490
242554
296793
539347
204670
288696
493366
262623
302933
5
300-390
113568
133473
247041
82468
131025
213493
146106
144381
6
200-290
30728
29154
59882
18788
26562
45350
32500
24933
Variables not shown: write_total (int)
sat %>%
select(-contains("total")) %>%
gather(part_sex, count, -score_range) %>%
separate(part_sex, c("part", "sex")) %>%
print
1
2
3
4
5
6
7
8
9
10
..
score_range
700-800
600-690
500-590
400-490
300-390
200-290
700-800
600-690
500-590
400-490
...
part
read
read
read
read
read
read
read
read
read
read
...
sex
male
male
male
male
male
male
fem
fem
fem
fem
...
count
40151
121950
227141
242554
113568
30728
38898
126084
259553
296793
...
score_range
700-800
600-690
500-590
400-490
300-390
200-290
700-800
600-690
500-590
400-490
...
part
read
read
read
read
read
read
read
read
read
read
...
sex
male
male
male
male
male
male
fem
fem
fem
fem
...
count
40151
121950
227141
242554
113568
30728
38898
126084
259553
296793
...
total
776092
776092
776092
776092
776092
776092
883955
883955
883955
883955
...
prop
0.05173485
0.15713343
0.29267278
0.31253253
0.14633317
0.03959324
0.04400450
0.14263622
0.29362694
0.33575578
...