How to filter R DataFrame by values in a column?

Last Updated : 30 May, 2021

In R Programming Language, dataframe columns can be subjected to constraints, and produce smaller subsets. However, while the conditions are applied, the following properties are maintained :

Rows are considered to be a subset of the input.
Rows in the subset appear in the same order as the original dataframe.
Columns remain unmodified.
The number of groups may be reduced, based on conditions.
dataframe attributes are preserved during data filter.

Method 1 : Using dataframe indexing

Any dataframe column in the R programming language can be referenced either through its name df$col-name or using its index position in the dataframe df[col-index]. The cell values of this column can then be subjected to constraints, logical or comparative conditions, and then a dataframe subset can be obtained. These conditions are applied to the row index of the dataframe so that the satisfied rows are returned.

Selection based on a check of missing values or NA

Cells in dataframe can contain missing values or NA as its elements, and they can be verified using is.na() method in R language.

Example:

# declaring a dataframe
data_frame = data.frame(col1 = c(NA,"b",NA,"e","e") , 
                        col2 = c(0,2,1,4,5), 
                        col3= c(TRUE,FALSE,FALSE,TRUE, TRUE))

print ("Original dataframe")
print (data_frame)

# checking which values are not NA
data_frame_mod <- data_frame[!is.na(data_frame$col1),]

print ("Modified dataframe")
print (data_frame_mod)

Output

[1] "Original dataframe"

col1 col2 col3

1 <NA> 0 TRUE

2 b 2 FALSE

3 <NA> 1 FALSE

4 e 4 TRUE

5 e 5 TRUE

[1] "Modified dataframe"

col1 col2 col3

2 b 2 FALSE

4 e 4 TRUE

5 e 5 TRUE

Selection based on a single comparative condition on a column

Column values can be subjected to constraints to filter and subset the data. The values can be mapped to specific occurrences or within a range.

Example:

# declaring a dataframe
data_frame = data.frame(col1 = c("b","b","e","e","e") , 
                        col2 = c(0,2,1,4,5), 
                        col3= c(TRUE,FALSE,FALSE,TRUE, TRUE))

print ("Original dataframe")
print (data_frame)

# checking which columns have col3
# value equivalent to true
data_frame_mod <- data_frame[data_frame$col3==TRUE,]

print ("Modified dataframe")
print (data_frame_mod)

Output

[1] "Original dataframe"

col1 col2 col3

1 b 0 TRUE

2 b 2 FALSE

3 e 1 FALSE

4 e 4 TRUE

5 e 5 TRUE

[1] "Modified dataframe"

col1 col2 col3

1 b 0 TRUE

4 e 4 TRUE

5 e 5 TRUE

Selection based on multiple comparative conditions on a column

Column values can be subjected to constraints to filter and subset the data. The conditions can be combined by logical & or | operators. The %in% operator is used here, in order to check values that match to any of the values within a specified vector.

Example:

# declaring a dataframe
data_frame = data.frame(col1 = c("b","b","d","e","e") , 
                        col2 = c(0,2,1,4,5), 
                        col3= c(TRUE,FALSE,FALSE,TRUE, TRUE))

print ("Original dataframe")
print (data_frame)

# checking which values of col1 
# are equivalent to b or e
data_frame_mod <- data_frame[data_frame$col1 %in% c("b","e"),]
print ("Modified dataframe")
print (data_frame_mod)

Output

[1] "Original dataframe"

col1 col2 col3

1 b 0 TRUE

2 b 2 FALSE

3 d 1 FALSE

4 e 4 TRUE

5 e 5 TRUE

[1] "Modified dataframe"

col1 col2 col3

1 b 0 TRUE

2 b 2 FALSE

4 e 4 TRUE

5 e 5 TRUE

Method 2 : Using dplyr library

The dplyr library can be installed and loaded into the working space which is used to perform data manipulation.

The filter() function is used to produce a subset of the dataframe, retaining all rows that satisfy the specified conditions. The filter() method in R can be applied to both grouped and ungrouped data. The expressions include comparison operators (==, >, >= ) , logical operators (&, |, !, xor()) , range operators (between(), near()) as well as NA value check against the column values. The subset dataframe has to be retained in a separate variable.

Syntax:

filter(df , cond)

Parameter :

df - The dataframe object

cond - The condition to filter the data upon

Example:

library ("dplyr")

# declaring a dataframe
data_frame = data.frame(col1 = c("b","b","d","e","e") , 
                        col2 = c(0,2,1,4,5), 
                        col3= c(TRUE,FALSE,FALSE,TRUE, TRUE))

print ("Original dataframe")
print (data_frame)

# checking which values of col1 
# are equivalent to b or e
data_frame_mod <- filter(data_frame,col2>1)

print ("Modified dataframe")
print (data_frame_mod)

Output

[1] "Original dataframe"

col1 col2 col3

1 b 0 TRUE

2 b 2 FALSE

3 d 1 FALSE

4 e 4 TRUE

5 e 5 TRUE

[1] "Modified dataframe"

col1 col2 col3

1 b 2 FALSE

2 e 4 TRUE

3 e 5 TRUE

Also, the values can be checked using the %in% operator to match the column cell values with the elements contained in the input specified vector.