Count non-NA values by group in DataFrame in R
Last Updated :
30 Jun, 2021
In this article, we will discuss how to count non-NA values by the group in dataframe in R Programming Language.
Method 1 : Using group_by() and summarise() methods
The dplyr package is used to perform simulations in the data by performing manipulations and transformations. The group_by() method in R programming language is used to group the specified dataframe in R. It can be used to categorize data depending on various aggregate functions like count, minimum, maximum, or sum.
Syntax:
group_by(col-name)
On application of group_by() method, the summarize method is applied to compute a tally of the total values obtained according to each group. The summation of the non-null values is calculated using the designated column name and the aggregate method sum() supplied with the is.na() method as its argument.
Syntax:
summarise ( new-col-name = sum(is.na (col-name))
Both the methods are applied in order to the input dataframe using the pipe operator. The output is returned in the form of a tibble, with the first column consisting of the input arguments of the group_by method and the second column being assigned the new column name specified and containing a summation of the values of each column.
Example:
R
# creating a dataframe
data_frame <- data.frame(col1 = sample(6:9, 9 , replace = TRUE),
col2 = letters[1:3],
col3 = c(1,4,NA,1,NA,NA,2,NA,2))
print ("Original DataFrame")
print (data_frame)
# grouping data by col1 and giving a total of
# non na values in col3
data_frame %>% group_by(col1) %>% summarise(
non_na = sum(!is.na(col3)))
Output
[1] "Original DataFrame"
col1 col2 col3
1 6 a 1
2 8 b 4
3 6 c NA
4 8 a 1
5 8 b NA
6 9 c NA
7 8 a 2
8 7 b NA
9 6 c 2
# A tibble: 4 x 2
col1 non_na
<int> <int>
1 6 2
2 7 0
3 8 3
4 9 0
Method 2: Using data.table
The library data.table in R is used to make statistical computations and deliberations based on the organization of data into well-defined tabular structures. The setDT method in R is used to convert lists (both named and unnamed) and dataframes to datatables by reference. The similar sum() and is.na() methods are applied over the columns of the dataframe in sequence to obtain the final output. The output returned is in the form of a data.table with row numbers followed by row identifiers followed by colon.
Syntax:
setDT(df)[, .(new-col-name = sum(!is.na(new-col-name))), col-name]
Example:
R
# importing required libraries
library(data.table)
# creating a dataframe
data_frame <- data.frame(col1 = sample(6:9, 9 , replace = TRUE),
col2 = letters[1:3],
col3 = c(1,4,NA,1,NA,NA,2,NA,2))
print ("Original DataFrame")
print (data_frame)
# grouping data by col1 and giving a total
# of non na values in col3
mod_df <- setDT(data_frame)[, .(non_na = sum(!is.na(col3))), col1]
print ("Modified DataFrame")
print (mod_df)
Output
[1] "Original DataFrame"
col1 col2 col3
1 7 a 1
2 6 b 4
3 6 c NA
4 7 a 1
5 9 b NA
6 8 c NA
7 6 a 2
8 8 b NA
9 8 c 2
[1] "Modified DataFrame"
col1 non_na
1: 7 2
2: 6 2
3: 9 0
4: 8 1
Method 3: Using aggregate method
The aggregate method in R is used to create the subsets produced from the result of dataframe splitting and then computes the summary statistics for each of the returned group.
Syntax:
aggregate (x , data , FUN)
Parameter :
x - the R storage object.
data - the dataframe or list to apply the aggregate method to.
FUN - the function to apply to each of the groups of the dataframe.
The cbind() method in R programming language is used to produce a concatenation of the columns produced as the output. The FUN applied is the sum operation to compute the sum of the non-null values segregated based on groups. The data is the input dataframe over which the FUN is applied.
Example:
R
# importing required libraries
library(data.table)
# creating a dataframe
data_frame <- data.frame(col1 = sample(6:9, 9 , replace = TRUE),
col2 = letters[1:3],
col3 = c(1,4,NA,1,NA,NA,2,NA,2))
print ("Original DataFrame")
print (data_frame)
# grouping data by col1 and giving a total
# of non na values in col3
mod_df <- aggregate(cbind(
non_na = !is.na(col3))~col1, data_frame, sum)
print ("Modified DataFrame")
print (mod_df)
Output
[1] "Original DataFrame"
col1 col2 col3
1 7 a 1
2 6 b 4
3 6 c NA
4 7 a 1
5 9 b NA
6 8 c NA
7 6 a 2
8 8 b NA
9 8 c 2
[1] "Modified DataFrame"
col1 non_na
1 7 2
2 6 2
3 9 0
4 8 1
Method 4 : Using table() method
The library data.table in R is used to make statistical computations and deliberations based on the organization of data into well-defined tabular structures. The table() method is used to generate a contingency table of the counts after computing the combination of each of the factor levels. Therefore, it is used to perform categorical tabulation of the data. Initially, the required column to check for NA values is specified under the constraint using the is.na() function. The non-null values are then extracted and a tally of them is produced using the data.table indexing methods.
Syntax:
is.na (df$col-name))
Example:
R
# importing required libraries
library(data.table)
# creating a dataframe
data_frame <- data.frame(col1 = sample(6:9, 9 , replace = TRUE),
col2 = letters[1:3],
col3 = c(1,4,NA,1,NA,NA,2,NA,2))
print ("Original DataFrame")
print (data_frame)
# grouping data by col1 and giving a
# total of non na values in col3
mod_df <- table(data_frame$col1[!is.na(data_frame$col3)])
print ("Modified DataFrame")
print (mod_df)
Output
[1] "Original DataFrame"
col1 col2 col3
1 7 a 1
2 9 b 4
3 8 c NA
4 6 a 1
5 6 b NA
6 8 c NA
7 9 a 2
8 9 b NA
9 8 c 2
[1] "Modified DataFrame"
6 7 8 9
1 1 1 2
Similar Reads
Count the number of NA values in a DataFrame column in R A null value in R is specified using either NaN or NA. In this article, we will see how can we count these values in a column of a dataframe. Approach Create dataframePass the column to be checked to is.na() function Syntax: is.na(column) Parameter: column: column to be searched for na values Return
1 min read
Create Lagged Variable by Group in R DataFrame Lagged variable is the type of variable that contains the previous value of the variable for which we want to create the lagged variable and the first value is neglected. Data can be segregated based on different groups in R programming language and then these categories can be processed differently
5 min read
How to filter R DataFrame by values in a column? In R Programming Language, dataframe columns can be subjected to constraints, and produce smaller subsets. However, while the conditions are applied, the following properties are maintained : Rows are considered to be a subset of the input.Rows in the subset appear in the same order as the original
5 min read
Count non zero values in each column of R dataframe In this article, we are going to count the number of non-zero data entries in the data using R Programming Language. To check the number of non-zero data entries in the data first we have to put that data in the data frame by using: data <- data.frame(x1 = c(1,2,0,100,0,3,10), x2 = c(5,0,1,8,10,0
2 min read
Count Unique Values by Group in R In the article, we are going to discuss how to count the number of unique values by the group in R Programming Language. So let's take the following example, Suppose you have a dataset with multiple columns like this: Â class age age_group 1 A 20 YOUNG 2 B 15 KID 3 C 45 OLD 4 B 14 KID 5 A 21 YOUNG 6
2 min read
Frequency count of multiple variables in R Dataframe A data frame may contain repeated or missing values. Each column may contain any number of duplicate or repeated instances of the same variable. Data statistics and analysis mostly rely on the task of computing the frequency or count of the number of instances a particular variable contains within e
4 min read