How to use Is Not in PySpark

Last Updated : 29 Jul, 2024

Null values are undefined or empty data present in a dataframe. These null values may be added due to some errors in data transfer or technical glitches. We should identify null values and make necessary changes in the Dataframe to address null values. In this article, we will learn about the usage of the IsNot method in Pyspark to remove the NULL values from our dataframe.

What is PySpark?

PySpark is a Python API for Apache Spark which is an open-source distributed computing system designed for large-scale data processing. It enables data scientists to utilize Spark's capabilities using Python, allowing for seamless data manipulation, analysis, and machine learning at scale.

The isNotNull Method

The isNotNull() method is provided by Spark SQL and operates on the Column class to check whether the column contains any null values. The return type of this method is boolean i.e. the method returns True if it does not find any null values or returns False if null value is found in the particular column. It is used with filter method of DataFrame class which takes condition as an argument to filter out particular rows.

Using isNotNull Method in Pyspark

To use the isNotNull method, apply it to a DataFrame column and then use the filter function to retain only the rows that meet the condition. The process is straightforward and integrates well with PySpark's DataFrame operations.

Example:

Here we will create adataframe with with some null values using Python in Pyspark. We have used None which is an inbuilt datatype in Python to represent null values. The DataFrame is created using list of Row objects which takes column names and their respective values as arguments. To visualize the output in the form of table, we have used show() method of DataFrame object.

Python

from pyspark.sql import SparkSession
from pyspark.sql import Row

spark = SparkSession.builder.appName("GeeksForGeeks").getOrCreate()

data = [
        Row(name="Alpha", age=20, marks=54), 
        Row(name="Beta", age=None, marks=None), 
        Row(name="Omega", age=17, marks=85), 
        Row(name="Sigma", age=None, marks=62)
       ]

df = spark.createDataFrame(data)
df.show()

Output:

Example:

In this example, we will filter out null values in the Age column of the DataFrame by using filter method and passing isNotNull() method which will check whether the column contains null value or not. It will display only those rows which does not contain a Null value.

Python

new_df = df.filter(df["age"].isNotNull())
# or
# new_df = df.filter(df.age.isNotNull())
new_df.show()

Output:

Screenshot-(164) — Filtered DataFrame based on Age

Example:

In this example, we will filter out the null value rows of the marks column and display the rest of the rows.

Python

new_df = df.filter(df["marks"].isNotNull())
# or
# new_df = df.filter(df.marks.isNotNull())
new_df.show()

Output:

Screenshot-(165) — Filtered DataFrame based on Marks

Example:

We can also select multiple columns, by useing AND operator to check null values on two or more columns.

Python

new_df = df.filter(df["age"].isNotNull() & df["marks"].isNotNull())
# or
# new_df = df.filter(df.age.isNotNull() & df.marks.isNotNull())
new_df.show()

Output:

Screenshot-(166) — Filtered DataFrame based on Age and Marks

Conclusion

In this article we have seen how to filter out null values from one column or multiple columns using isNotNull() method provided by PySpark Library. We have provided suitable examples which can be easily integrated to your personal use cases.

Q. What is the isNull() method of Column object?

Column.isNull() is a method which returns True if the expression contains null values and can be used along with filter() method to select all the rows of the DataFrame containing null values. It is quite opposite to isNotNull() which is used to find rows which does not contain null values.

Q. What is the difference between show() and collect()?

Both methods are used to display the DataFrame. DataFrame.show() method represent the DataFrame as a table in the terminal and takes an integer "n" as an optional argument which displays first "n" rows of the table. It also has other optional arguments such as truncate which can be set to any integer numbers to limit the length of data present inside each cell and vertical which is a boolean argument used to show the DataFrame vertically. DataFrame.collect() method is used to display DataFrame as a list of Row objects. The output generated by the collect method is not visually represented as a table.

Q. Is it mandatory to use Row object to create DataFrame?

It is not mandatory to use Row object to generate DataFrame. We can also create a Spark DataFrame using a list of tuples and another list containing column name (header). Both these lists are then passed to createDataFrame(data, header) method and the result is similar to that of using Row objects to create DataFrames.