How to find group-wise summary statistics for R dataframe?
Last Updated :
21 Apr, 2021
Finding group-wise summary statistics for the dataframe is very useful in understanding our data frame. The summary includes statistical data: mean, median, min, max, and quartiles of the given dataframe. The summary can be computed on a single column or variable, or the entire dataframe. In this article, we are going to see how to find group-wise summary statistics for data frame in R Programming Language.
In the code below we have used a built-in data set: iris flower dataset. Then we can inspect our dataset by using the head() or tail() function which will print the top and bottom part of the dataframe. In the code below, we have displayed the top 10 rows of our sample dataframe.
R
# import data
df <- iris
# inspecting the dataset
head(df, 10)
Output:

Summary of single variable or column
Our dataframe is stored in the "df" variable. We want to print the summary of the column: Sepal.Length. So, we pass "df$Sepal.length" as an argument in the summary() function. Â
Syntax: summary(dataframe$column_name)
The summary() function takes in a dataframe column and returns:Â
- Central Tendency-> mean and median,Â
- Interquartile range-> 25th and 75th quartiles,
- Range-> min, and max values for that single column.
Example 1:
R
df <- iris
summary(df$Sepal.Length)
Output:
Example 2: We can also pass the "digits" as an argument which specifies up to how many decimal places we want to correct our output values
Syntax: summary(dataframe$column_name , digits=number_of_decimal_places)
R
df <- iris
summary(df$Sepal.Width, digits = 3)
Output:

Â
Summary of entire dataframe
In the code below, we have passed the entire dataframe as an argument in the summary() function, so it computes a summary of the entire dataframe(all the columns or variables)
Syntax: summary(dataframe_name)
R
Output:

Group-wise summary of data
For a better understanding of Dataframe in R, it is recommended to refer R - Data Frames article.
Let's create a sample dataframe first:Â
R
df <- data.frame(
Weekday = factor(rep(c("Mon", "Tues", "Wed",
"Thurs", "Fri"), each = 4),
levels = c("Mon", "Tues", "Wed",
"Thurs", "Fri")),
Quarter = paste0("Q", rep(1:4, each = 5)),
Delay = c(9.9, 5.4, 8.8, 6.9, 4.9, 9.7, 7.9, 5, 8.8,
11.1, 10.2, 9.3, 12.2, 10.2, 9.2, 9.7, 12.2,
8.1, 7.9, 5.6))
df
Output:

Summarising group-wise data of Single Variable
Our data frame consists of 3 variables: Week-day, Quarter, and Delay. The variable which we will be summarising is Delay and in the process, Quarter variable will be collapsed.Â
In the below code, we will be using dplyr package. The dplyr package in R is a structure of data manipulation that provides a uniform set of verbs, helping to resolve the most frequent data manipulation hurdles. We will be performing a grouping operation using the group_by() function and a summary operation using the summarize() function. Then we will calculate 2 statistical summaries: maximum delay time and minimum delay time.
Syntax: group_by(variable_name)
R
library(dplyr)
df <- data.frame(
Weekday = factor(rep(c("Mon", "Tues", "Wed", "Thurs",
"Fri"), each = 4),
levels = c("Mon", "Tues", "Wed", "Thurs",
"Fri")),
Quarter = paste0("Q", rep(1:4, each = 5)),
Delay = c(9.9, 5.4, 8.8, 6.9, 4.9, 9.7, 7.9, 5, 8.8,
11.1, 10.2, 9.3, 12.2, 10.2, 9.2, 9.7, 12.2,
8.1, 7.9, 5.6))
df %>%
group_by(Weekday) %>%
summarize(min_delay = min(Delay), max_delay = max(Delay))
Output:

Summarising group-wise data of Multiple Variable
Let's create another sample dataframe ->df2:
R
# sample dataframe
df2 <- data.frame(
Quarter = paste0("Q", rep(1:4, each = 4)),
Week = rep(c("Weekday", "Weekend"), each=2, times=4),
Direction = rep(c("Inbound", "Outbound"), times=8),
Delay = c(10.8, 9.7, 15.5, 10.4, 11.8, 8.9, 5.5,
3.3, 10.6, 8.8, 6.6, 5.2, 9.1, 7.3, 5.3, 4.4))
df2
Output:

Summarizing data group-wise:
In this case, our dataframe is having 4 variables: Quarter, Week, Direction, Delay. In the code below, we have grouped and summarised by Quarter and Week, and in the process, the variable Direction is collapsed.
Syntax: group_by(variable_name1,variable_name2 )
R
library(dplyr)
# sample dataframe
df2 <- data.frame(
Quarter = paste0("Q", rep(1:4, each = 4)),
Week = rep(c("Weekday", "Weekend"), each=2, times=4),
Direction = rep(c("Inbound", "Outbound"), times=8),
Delay = c(10.8, 9.7, 15.5, 10.4, 11.8, 8.9, 5.5,
3.3, 10.6, 8.8, 6.6, 5.2, 9.1, 7.3, 5.3, 4.4))
# summarizing by group
df2 %>%
group_by(Quarter, Week) %>%
summarize(min_delay = min(Delay), max_delay = max(Delay))
Output:
Similar Reads
How to get summary statistics by group in R In this article, we will learn how to get summary statistics by the group in R programming language. Sample dataframe in use: grpBy num 1 A 20 2 A 30 3 A 40 4 B 50 5 B 50 6 C 70 7 C 80 8 C 25 9 C 35 10 D 45 11 E 55 12 E 65 13 E 75 14 E 85 15 E 95 16 E 105Method 1: Using tapply() tapply() function in
6 min read
How to Calculate Summary Statistics by Group in R? In this article, we will discuss how to calculate summary statistics by the group in the R programming language. What is summary statistics in R?Summary Statistics by Group in R Programming Language are numerical or graphical representations that provide a concise and informative overview of a datas
5 min read
Compute Summary Statistics In R Summary statistics provide a concise overview of the characteristics of a dataset, offering insights into its central tendency, dispersion, and distribution. R Programming Language with its variety of packages, offers several methods to compute summary statistics efficiently. Here we'll explore vari
4 min read
How To Calculate Summary Statistics In Pandas Pandas, an incredibly versatile data manipulation library for Python, has various capabilities to calculate summary statistics on datasets. Summary statistics can give you a fast and comprehensive overview of the most important features of a dataset. In the following article, we will explore five me
4 min read
How to Write a Loop to Run the t-Test of a Data Frame in R In statistical analysis, the t-test is used to compare the means of two groups to determine whether there is a significant difference between them. Often, you may need to run t-tests for multiple variables in a data frame. Writing a loop in R allows you to automate this process, which is especially
4 min read