How to filter R DataFrame by values in a column?
Last Updated :
30 May, 2021
In R Programming Language, dataframe columns can be subjected to constraints, and produce smaller subsets. However, while the conditions are applied, the following properties are maintained :
- Rows are considered to be a subset of the input.
- Rows in the subset appear in the same order as the original dataframe.
- Columns remain unmodified.
- The number of groups may be reduced, based on conditions.
- dataframe attributes are preserved during data filter.
Method 1 : Using dataframe indexing
Any dataframe column in the R programming language can be referenced either through its name df$col-name or using its index position in the dataframe df[col-index]. The cell values of this column can then be subjected to constraints, logical or comparative conditions, and then a dataframe subset can be obtained. These conditions are applied to the row index of the dataframe so that the satisfied rows are returned.
- Selection based on a check of missing values or NA
Cells in dataframe can contain missing values or NA as its elements, and they can be verified using is.na() method in R language.
Example:
R
# declaring a dataframe
data_frame = data.frame(col1 = c(NA,"b",NA,"e","e") ,
col2 = c(0,2,1,4,5),
col3= c(TRUE,FALSE,FALSE,TRUE, TRUE))
print ("Original dataframe")
print (data_frame)
# checking which values are not NA
data_frame_mod <- data_frame[!is.na(data_frame$col1),]
print ("Modified dataframe")
print (data_frame_mod)
Output
[1] "Original dataframe"
col1 col2 col3
1 <NA> 0 TRUE
2 b 2 FALSE
3 <NA> 1 FALSE
4 e 4 TRUE
5 e 5 TRUE
[1] "Modified dataframe"
col1 col2 col3
2 b 2 FALSE
4 e 4 TRUE
5 e 5 TRUE
- Selection based on a single comparative condition on a column
Column values can be subjected to constraints to filter and subset the data. The values can be mapped to specific occurrences or within a range.
Example:
R
# declaring a dataframe
data_frame = data.frame(col1 = c("b","b","e","e","e") ,
col2 = c(0,2,1,4,5),
col3= c(TRUE,FALSE,FALSE,TRUE, TRUE))
print ("Original dataframe")
print (data_frame)
# checking which columns have col3
# value equivalent to true
data_frame_mod <- data_frame[data_frame$col3==TRUE,]
print ("Modified dataframe")
print (data_frame_mod)
Output
[1] "Original dataframe"
col1 col2 col3
1 b 0 TRUE
2 b 2 FALSE
3 e 1 FALSE
4 e 4 TRUE
5 e 5 TRUE
[1] "Modified dataframe"
col1 col2 col3
1 b 0 TRUE
4 e 4 TRUE
5 e 5 TRUE
- Selection based on multiple comparative conditions on a column
Column values can be subjected to constraints to filter and subset the data. The conditions can be combined by logical & or | operators. The %in% operator is used here, in order to check values that match to any of the values within a specified vector.
Example:
R
# declaring a dataframe
data_frame = data.frame(col1 = c("b","b","d","e","e") ,
col2 = c(0,2,1,4,5),
col3= c(TRUE,FALSE,FALSE,TRUE, TRUE))
print ("Original dataframe")
print (data_frame)
# checking which values of col1
# are equivalent to b or e
data_frame_mod <- data_frame[data_frame$col1 %in% c("b","e"),]
print ("Modified dataframe")
print (data_frame_mod)
Output
[1] "Original dataframe"
col1 col2 col3
1 b 0 TRUE
2 b 2 FALSE
3 d 1 FALSE
4 e 4 TRUE
5 e 5 TRUE
[1] "Modified dataframe"
col1 col2 col3
1 b 0 TRUE
2 b 2 FALSE
4 e 4 TRUE
5 e 5 TRUE
Method 2 : Using dplyr library
The dplyr library can be installed and loaded into the working space which is used to perform data manipulation.
The filter() function is used to produce a subset of the dataframe, retaining all rows that satisfy the specified conditions. The filter() method in R can be applied to both grouped and ungrouped data. The expressions include comparison operators (==, >, >= ) , logical operators (&, |, !, xor()) , range operators (between(), near()) as well as NA value check against the column values. The subset dataframe has to be retained in a separate variable.
Syntax:
filter(df , cond)
Parameter :
df - The dataframe object
cond - The condition to filter the data upon
Example:
R
library ("dplyr")
# declaring a dataframe
data_frame = data.frame(col1 = c("b","b","d","e","e") ,
col2 = c(0,2,1,4,5),
col3= c(TRUE,FALSE,FALSE,TRUE, TRUE))
print ("Original dataframe")
print (data_frame)
# checking which values of col1
# are equivalent to b or e
data_frame_mod <- filter(data_frame,col2>1)
print ("Modified dataframe")
print (data_frame_mod)
Output
[1] "Original dataframe"
col1 col2 col3
1 b 0 TRUE
2 b 2 FALSE
3 d 1 FALSE
4 e 4 TRUE
5 e 5 TRUE
[1] "Modified dataframe"
col1 col2 col3
1 b 2 FALSE
2 e 4 TRUE
3 e 5 TRUE
Also, the values can be checked using the %in% operator to match the column cell values with the elements contained in the input specified vector.
Example:
R
library ("dplyr")
# declaring a dataframe
data_frame = data.frame(col1 = c("b","b","d","e","e") ,
col2 = c(0,2,1,4,5),
col3= c(TRUE,FALSE,FALSE,TRUE, TRUE))
print ("Original dataframe")
print (data_frame)
# checking which values of col1
# are equivalent to b or e
data_frame_mod <- filter(data_frame,col1 %in% c("b","e"))
print ("Modified dataframe")
print (data_frame_mod)
Output
[1] "Original dataframe"
col1 col2 col3
1 b 0 TRUE
2 b 2 FALSE
3 d 1 FALSE
4 e 4 TRUE
5 e 5 TRUE
[1] "Modified dataframe"
col1 col2 col3
1 b 0 TRUE
2 b 2 FALSE
4 e 4 TRUE
5 e 5 TRUE