Chapter 3 _STAT1204..
Chapter 3 _STAT1204..
2024-12-09
## $Petal.Length
## [1] 3.758
## $Petal.Width
## [1] 1.199333
You see! lapply function works column wise instead of row wise when working
with data frames.
Example :create a function that will add 10 to the input value and use
the lapply function to work on a vector.
# Create a vector
current_ages <- c(21, 43, 12, 56, 32)
Remember that we created a function add_10 that adds 10 to the current ages of the
clients. Lets repeat the same using the sapply function instead of lapply function.
# Calculate the variance for each numeric column
ages_10_years_later <- sapply(current_ages, add_10)
print(ages_10_years_later)
## [1] 31 53 22 66 42
It is now evident that sapply has a simpler output than the lapply function.
The tapply() function applies a function to subsets of data grouped by a
factor (e.g., species in our case). Let’s calculate the average sepal length for
each species:
# Calculate the average Sepal.Length for each Species
avg_sepal_by_species <- tapply(iris$Sepal.Length, iris$Species, mean)
print(avg_sepal_by_species)
## setosa versicolor virginica
## 5.006 5.936 6.588
Finally the mapply() function is useful when you want to apply a function
to multiple sets of arguments at once. Let’s calculate the sum
of Sepal.Length and Sepal.Width for each row:
# Sum Sepal.Length and Sepal.Width for each row
sepal_sum <- mapply(sum, iris$Sepal.Length, iris$Sepal.Width)
head(sepal_sum)
## [1] 8.6 7.9 7.9 7.7 8.6 9.3
This function adds the sepal length and width for each flower row by row. It’s like
your helper asking every customer for two values and summing them up together.
Practical Exercise
1. Use apply() to calculate the maximum for each column in the iris data set.
Solution
max_values <- apply(iris[, 1:4], 2, max)
print(max_values)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 7.9 4.4 6.9 2.5
2. Use lapply() to find the summary statistics (use the summary() function) for
each numeric column in the iris data set.
Solution
sum_stats <- lapply(iris[,1:4], summary)
print(sum_stats)
## $Sepal.Length
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.300 5.100 5.800 5.843 6.400 7.900
## $Sepal.Width
University of Tabuk – Faculty of Science – Dept. of Stat. Statistical Computing– STAT 1204
# Use aggregate to find the average 'mpg' (miles per gallon) grouped by the
number of cylinders ('cyl')
avg_mpg_by_cyl <- aggregate(mpg ~ cyl,
University of Tabuk – Faculty of Science – Dept. of Stat. Statistical Computing– STAT 1204
data = mtcars,
FUN = mean)
avg_mpg_by_cyl
## cyl mpg
## 1 4 26.66364
## 2 6 19.74286
## 3 8 15.10000
If we break done the code;
i. mpg ~ cyl tells R to calculate the average mpg(dependent variable) for each
unique value of cyl(grouping factor).
ii. data = mtcars specifies the data set.
iii. FUN = mean applies the mean function to compute the average mpg for
each group of cyl.
We have just calculated the average mpg (miles per gallon) grouped by the number
of cyl(cylinders). Let’s make it a little bit more complex by grouping with multiple
variables and summarize multiple columns as well. We will calculate the mean
horsepower(hp) and the weight(wt) by the number of cylinders(cyl) and the
number of transmission(am).
Example: Use aggregate to find the mean hp and wt by cylinders and transmission
type
avg_hp_wt_by_cyl_am <- aggregate(cbind(hp, wt) ~ cyl + am,
data = mtcars,
FUN = mean)
avg_hp_wt_by_cyl_am
## cyl am hp wt
## 1 4 0 84.66667 2.935000
## 2 6 0 115.25000 3.388750
## 3 8 0 194.16667 4.104083
## 4 4 1 81.87500 2.042250
## 5 6 1 131.66667 2.755000
## 6 8 1 299.50000 3.370000
If we breakdown the code;
i. cbind(hp, wt) allows you to summarize multiple columns (hp and wt).
ii. cyl + am groups the data by the number of cylinders and the transmission
type (am = 0 for automatic, 1 for manual`).
iii. The argument FUN defines the function to be used here therefore, FUN =
mean calculates the mean values for hp and wt for each group of cyl and am.
Practical Exercise
using the aggregate() with the iris data set to find the mean sepal length
(Sepal.Length) and petal length(Petal.Length) for each species.
University of Tabuk – Faculty of Science – Dept. of Stat. Statistical Computing– STAT 1204
Solution
library(plyr)
avg_sepal_petal_by_species
## Species Sepal.Length Petal.Length
## 1 setosa 5.006 1.462
## 2 versicolor 5.936 4.260
## 3 virginica 6.588 5.552
__________________________________________________________________
3.3 Data Reshaping
Data reshaping is the process of transforming the layout or structure of a data set
without changing the actual data. You typically reshape data to suit different
analyses, visualizations, or reporting formats. Common operations for reshaping
include pivoting data between wide and long formats.
Wide format: Each subject(row) has its own columns for measurements at
different time points or categories.
Long format: The data has one measurement per row, making it easier to
analyze in some cases, especially with repeated measures.
In R, the most common function for reshaping data include;
pivot_longer() and pivot_wider() from the tidyr package.
melt() and dcast() from the reshape2 package.
Try it!
Let’s have some fun by working on the mtcars data set where we will demonstrate
reshaping between wide and long formats
Step 1: Inspect the Data
The mtcars data set is already in a wide format where each row represents a car,
and columns represent different variables for instance mpg, cyl, hp.
data(mtcars) # Load the data set
Solution
Convert to long format
library(tidyr)