0% found this document useful (0 votes)
4 views

Chapter 3 _STAT1204..

The document discusses group manipulation in R, focusing on methods for summarizing and analyzing data subsets using functions like apply, lapply, sapply, tapply, and mapply. It also covers the aggregate function from the plyr package for grouping and summarizing data, and introduces data reshaping techniques using pivot_longer and pivot_wider from the tidyr package. Practical exercises are provided to reinforce the concepts using the iris and mtcars datasets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Chapter 3 _STAT1204..

The document discusses group manipulation in R, focusing on methods for summarizing and analyzing data subsets using functions like apply, lapply, sapply, tapply, and mapply. It also covers the aggregate function from the plyr package for grouping and summarizing data, and introduces data reshaping techniques using pivot_longer and pivot_wider from the tidyr package. Practical exercises are provided to reinforce the concepts using the iris and mtcars datasets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

University of Tabuk – Faculty of Science – Dept. of Stat.

Statistical Computing– STAT 1204

Statistical Computing: Chapter 3: Group Manipulation

Basim Alsaedi&Dr. Dalia Alnagar

2024-12-09

Group manipulation in R refers to the process of grouping data based on


certain categories and then performing operations based on each group
separately. This is useful when you want to summarize, analyze or transform
subsets of your data independently. In simple terms, group manipulation involves
splitting the data into groups, applying a function to each group, and then
combining the results. We will explore different methods designed by researchers
for group manipulation. They are group manipulation using;
 The apply family,
 The aggregate from plyr package,
 Data reshaping
3.1 Apply Family
The apply family in R is a collection of functions that helps you apply operations
to data structures like vectors, lists, matrices and data frames in a more efficient
way than using loops. Think of these functions as a way to give commands to your
data in bulk, telling each piece what to do without repeating yourself.
We will have a quick overview of the members of the apply family;
 apply() - Works with matrices or data frames, applying a function to rows or
columns.
 lapply() - Loops over elements in a list, applying a function to each element
and returning a list.
 sapply() - Similar to lapply, but it returns a vector or matrix when possible.
 tapply() - Applies a function over subsets of data, especially useful for
factors or groups.
 mapply() - Applies a function to multiple arguments simultaneously.
Try it:
Here is the apply family in action using the built-in R data set that contains
information about flowers.
 Use apply to calculate the mean of each column in the iris data set at
once(No need of specifying the columns)
# Load and view the first few rows of the iris data set
data(iris)
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
University of Tabuk – Faculty of Science – Dept. of Stat. Statistical Computing– STAT 1204

## 3 4.7 3.2 1.3 0.2 setosa


## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
# Calculate the mean of each numeric column
col_means <- apply(iris[, 1:4], 2, mean)
print(col_means)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5.843333 3.057333 3.758000 1.199333
 The 2 in apply means “apply the function to columns” and the mean was
used to find the average of each column. This is simple as asking a helper to
calculate the the average for all types of flowers for each characteristic
(sepal length, petal length, etc.).
 Let’s repeat the same for a each row, instead of argument value 2 we will
put argument value 1 in the second position.
row_means <- apply(iris[, 1:4], 1, mean) # Calculate the mean for each row
head(row_means, 15) # Show the first fifteen averages of the row
## [1] 2.550 2.375 2.350 2.350 2.550 2.850 2.425 2.525 2.225 2.400 2.700 2.500
## [13] 2.325 2.125 2.800
 Now lets use the lapply function to find the range for each numeric column.
This function applies to each element and returns a list. No need to specify if
its a column or a row
# Calculate the range of each numeric column in the iris dataset
column_ranges <- lapply(iris[, 1:4], range)
print(column_ranges)
## $Sepal.Length
## [1] 4.3 7.9
## $Sepal.Width
## [1] 2.0 4.4
## $Petal.Length
## [1] 1.0 6.9
## $Petal.Width
## [1] 0.1 2.5
Repeating the function with mean function instead of the range function.
# Calculate the mean of each numeric column in the iris dataset
col_means <- lapply(iris[, 1:4], mean)
print(col_means)
## $Sepal.Length
## [1] 5.843333
## $Sepal.Width
## [1] 3.057333
University of Tabuk – Faculty of Science – Dept. of Stat. Statistical Computing– STAT 1204

## $Petal.Length
## [1] 3.758
## $Petal.Width
## [1] 1.199333
You see! lapply function works column wise instead of row wise when working
with data frames.
Example :create a function that will add 10 to the input value and use
the lapply function to work on a vector.
# Create a vector
current_ages <- c(21, 43, 12, 56, 32)

# Create a function that adds 10 to an input value


add_10 <- function(value){
return(value + 10) }
# Test the function
add_10 <- function(value){
return(value + 10) }
add_10 (27)
## [1] 37
# Apply the function to vector ages
ages_10_years_later <- lapply(current_ages, add_10)
ages_10_years_later # Show the result
## [[1]]
## [1] 31
## [[2]]
## [1] 53
## [[3]]
## [1] 22
## [[4]]
## [1] 66
## [[5]]
## [1] 42
It returns a list with values in the vector current_ages add 10 to each value.
 The sapply() function works similarly to lapply(), but it tries to simplify the
output. If possible, it will return a vector or matrix instead of a list. Let`s
calculate the variance for each numeric column;
# Calculate the variance for each numeric column
col_variance <- sapply(iris[, 1:4], var)
print(col_variance)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 0.6856935 0.1899794 3.1162779 0.5810063
University of Tabuk – Faculty of Science – Dept. of Stat. Statistical Computing– STAT 1204

Remember that we created a function add_10 that adds 10 to the current ages of the
clients. Lets repeat the same using the sapply function instead of lapply function.
# Calculate the variance for each numeric column
ages_10_years_later <- sapply(current_ages, add_10)
print(ages_10_years_later)
## [1] 31 53 22 66 42
It is now evident that sapply has a simpler output than the lapply function.
 The tapply() function applies a function to subsets of data grouped by a
factor (e.g., species in our case). Let’s calculate the average sepal length for
each species:
# Calculate the average Sepal.Length for each Species
avg_sepal_by_species <- tapply(iris$Sepal.Length, iris$Species, mean)
print(avg_sepal_by_species)
## setosa versicolor virginica
## 5.006 5.936 6.588
 Finally the mapply() function is useful when you want to apply a function
to multiple sets of arguments at once. Let’s calculate the sum
of Sepal.Length and Sepal.Width for each row:
# Sum Sepal.Length and Sepal.Width for each row
sepal_sum <- mapply(sum, iris$Sepal.Length, iris$Sepal.Width)
head(sepal_sum)
## [1] 8.6 7.9 7.9 7.7 8.6 9.3
This function adds the sepal length and width for each flower row by row. It’s like
your helper asking every customer for two values and summing them up together.
Practical Exercise

1. Use apply() to calculate the maximum for each column in the iris data set.
Solution
max_values <- apply(iris[, 1:4], 2, max)
print(max_values)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 7.9 4.4 6.9 2.5
2. Use lapply() to find the summary statistics (use the summary() function) for
each numeric column in the iris data set.
Solution
sum_stats <- lapply(iris[,1:4], summary)
print(sum_stats)
## $Sepal.Length
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.300 5.100 5.800 5.843 6.400 7.900
## $Sepal.Width
University of Tabuk – Faculty of Science – Dept. of Stat. Statistical Computing– STAT 1204

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 2.000 2.800 3.000 3.057 3.300 4.400
## $Petal.Length
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.600 4.350 3.758 5.100 6.900
## $Petal.Width
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.100 0.300 1.300 1.199 1.800 2.500
3. Use tapply() to find the average petal width for each species in the iris data
set.
Solution
# Calculate the average Petal.Width for each Species
avg_petal_width_by_species <- tapply(iris$Petal.Width, iris$Species, mean)
print(avg_petal_width_by_species)
## setosa versicolor virginica
## 0.246 1.326 2.026
3.2 Aggregate Plyr
The aggregate() function from plyr package is a powerful tool for grouping and
summarizing data in R. This is similar to the SQL GROUP BY command or
the tapply() that we have discussed above. The difference is that aggregate() allows
to summarize data based on one or more grouping factors.
Try it!
Let’s explore an example using the built-in mtcars data set to show how to use
the aggregate() from the plyr package. The plyr package can be installed by:
install.packages("plyr")
Lets start
library(plyr)
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then
dplyr:
library(plyr); library(dplyr)
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
# Load the data set
data("mtcars")

# Use aggregate to find the average 'mpg' (miles per gallon) grouped by the
number of cylinders ('cyl')
avg_mpg_by_cyl <- aggregate(mpg ~ cyl,
University of Tabuk – Faculty of Science – Dept. of Stat. Statistical Computing– STAT 1204

data = mtcars,
FUN = mean)
avg_mpg_by_cyl
## cyl mpg
## 1 4 26.66364
## 2 6 19.74286
## 3 8 15.10000
If we break done the code;
i. mpg ~ cyl tells R to calculate the average mpg(dependent variable) for each
unique value of cyl(grouping factor).
ii. data = mtcars specifies the data set.
iii. FUN = mean applies the mean function to compute the average mpg for
each group of cyl.
We have just calculated the average mpg (miles per gallon) grouped by the number
of cyl(cylinders). Let’s make it a little bit more complex by grouping with multiple
variables and summarize multiple columns as well. We will calculate the mean
horsepower(hp) and the weight(wt) by the number of cylinders(cyl) and the
number of transmission(am).
Example: Use aggregate to find the mean hp and wt by cylinders and transmission
type
avg_hp_wt_by_cyl_am <- aggregate(cbind(hp, wt) ~ cyl + am,
data = mtcars,
FUN = mean)

avg_hp_wt_by_cyl_am
## cyl am hp wt
## 1 4 0 84.66667 2.935000
## 2 6 0 115.25000 3.388750
## 3 8 0 194.16667 4.104083
## 4 4 1 81.87500 2.042250
## 5 6 1 131.66667 2.755000
## 6 8 1 299.50000 3.370000
If we breakdown the code;
i. cbind(hp, wt) allows you to summarize multiple columns (hp and wt).
ii. cyl + am groups the data by the number of cylinders and the transmission
type (am = 0 for automatic, 1 for manual`).
iii. The argument FUN defines the function to be used here therefore, FUN =
mean calculates the mean values for hp and wt for each group of cyl and am.
Practical Exercise
using the aggregate() with the iris data set to find the mean sepal length
(Sepal.Length) and petal length(Petal.Length) for each species.
University of Tabuk – Faculty of Science – Dept. of Stat. Statistical Computing– STAT 1204

Solution
library(plyr)

# Load the iris data set


data(iris)

# Calculate the averages as per the instructions


avg_sepal_petal_by_species <- aggregate(cbind(Sepal.Length, Petal.Length) ~
Species,
data = iris,
FUN = mean)

avg_sepal_petal_by_species
## Species Sepal.Length Petal.Length
## 1 setosa 5.006 1.462
## 2 versicolor 5.936 4.260
## 3 virginica 6.588 5.552
__________________________________________________________________
3.3 Data Reshaping
Data reshaping is the process of transforming the layout or structure of a data set
without changing the actual data. You typically reshape data to suit different
analyses, visualizations, or reporting formats. Common operations for reshaping
include pivoting data between wide and long formats.
 Wide format: Each subject(row) has its own columns for measurements at
different time points or categories.
 Long format: The data has one measurement per row, making it easier to
analyze in some cases, especially with repeated measures.
In R, the most common function for reshaping data include;
 pivot_longer() and pivot_wider() from the tidyr package.
 melt() and dcast() from the reshape2 package.
Try it!
Let’s have some fun by working on the mtcars data set where we will demonstrate
reshaping between wide and long formats
Step 1: Inspect the Data
The mtcars data set is already in a wide format where each row represents a car,
and columns represent different variables for instance mpg, cyl, hp.
data(mtcars) # Load the data set

# First few records of the data set


head(mtcars)
University of Tabuk – Faculty of Science – Dept. of Stat. Statistical Computing– STAT 1204

## mpg cyl disp hp drat wt qsec vs am gear carb


## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Step2: Converting from Wide to Long Format
We will use the pivot_longer() function from the tidyr package to convert the data
set from wide to long format. In this case, we will shape
the mpg, hp and wt columns into a longer format making it easier to work with.
library(tidyr)

# Reshape the data from wide to long format


mtcars_long <- mtcars %>%
pivot_longer(cols=c(mpg, hp, wt),
names_to = "variable",
values_to = "value")

# View the respaed data


head(mtcars_long)
## # A tibble: 6 × 10
## cyl disp drat qsec vs am gear carb variable value
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 6 160 3.9 16.5 0 1 4 4 mpg 21
## 2 6 160 3.9 16.5 0 1 4 4 hp 110
## 3 6 160 3.9 16.5 0 1 4 4 wt 2.62
## 4 6 160 3.9 17.0 0 1 4 4 mpg 21
## 5 6 160 3.9 17.0 0 1 4 4 hp 110
## 6 6 160 3.9 17.0 0 1 4 4 wt 2.88
If we break down the code;
i. pivot_longer() function moves the selected columns (mpg, hp, wt) into a
new “long” format, with eah row representing a unique combination of car
characteristics(variable) and their corresponding value.
ii. names_to = "variable": The variable names (e.g., mpg, hp, wt) are moved to
a column named “variable”.
iii. values_to = "value": The data for each variable is placed in a column
named "value".
Also, data in long format can be converted to a wide format.
The pivot_wider function from dplyr gets the work done.
Try it!
University of Tabuk – Faculty of Science – Dept. of Stat. Statistical Computing– STAT 1204

Lets put the pivot_wider function into practice. We will convert


the ntcars_long data set that we just recently generated to a wider format.
# Reshape from long to wide format
mtcars_wide <- mtcars_long %>%
pivot_wider(names_from = "variable", values_from = "value")

# View the reshaped data


head(mtcars_wide)
## # A tibble: 6 × 11
## cyl disp drat qsec vs am gear carb mpg hp wt
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 6 160 3.9 16.5 0 1 4 4 21 110 2.62
## 2 6 160 3.9 17.0 0 1 4 4 21 110 2.88
## 3 4 108 3.85 18.6 1 1 4 1 22.8 93 2.32
## 4 6 258 3.08 19.4 1 0 3 1 21.4 110 3.22
## 5 8 360 3.15 17.0 0 0 3 2 18.7 175 3.44
## 6 6 225 2.76 20.2 1 0 3 1 18.1 105 3.46
If we break down the code;
i. pivot_wider() converts the long format back into the wide format, with
separate columns for each variable (mpg, hp, wt).
ii. names_from = "variable": Moves the unique values from the "variable”
column into their own columns (e.g., mpg, hp, wt).
iii. values_from = "value": Populates the new columns with values from the
“value” column.
Practical Exercise
Use the pivot_longer() function to convert the iris dataset (which contains
measurements for different flower features) into a long format. Focus on
converting the numeric columns like Sepal.Length and Sepal.Width.
Then, use pivot_wider() to convert it back to a wide format.

Solution
Convert to long format
library(tidyr)

# Load the data


data(iris)

# Load the iris dataset and reshape it


iris_long <- iris %>% pivot_longer(cols = starts_with("Sepal"),
names_to = "feature", values_to = "measurement")
University of Tabuk – Faculty of Science – Dept. of Stat. Statistical Computing– STAT 1204

# View the reshaped data


head(iris_long)
## # A tibble: 6 × 5
## Petal.Length Petal.Width Species feature measurement
## <dbl> <dbl> <fct> <chr> <dbl>
## 1 1.4 0.2 setosa Sepal.Length 5.1
## 2 1.4 0.2 setosa Sepal.Width 3.5
## 3 1.4 0.2 setosa Sepal.Length 4.9
## 4 1.4 0.2 setosa Sepal.Width 3
## 5 1.3 0.2 setosa Sepal.Length 4.7
## 6 1.3 0.2 setosa Sepal.Width 3.2
Back to wide
# Now reshape it back to wide format
iris_wide <- iris_long %>%
pivot_wider(names_from = "feature", values_from = "measurement")

# View the reshaped data


head(iris_wide)
## # A tibble: 6 × 5
## Petal.Length Petal.Width Species Sepal.Length Sepal.Width
## <dbl> <dbl> <fct> <list> <list>
## 1 1.4 0.2 setosa <dbl [8]> <dbl [8]>
## 2 1.3 0.2 setosa <dbl [4]> <dbl [4]>
## 3 1.5 0.2 setosa <dbl [7]> <dbl [7]>
## 4 1.7 0.4 setosa <dbl [1]> <dbl [1]>
## 5 1.4 0.3 setosa <dbl [3]> <dbl [3]>
## 6 1.5 0.1 setosa <dbl [2]> <dbl [2]>

You might also like