How to use Is Not in PySpark
Last Updated :
29 Jul, 2024
Null values are undefined or empty data present in a dataframe. These null values may be added due to some errors in data transfer or technical glitches. We should identify null values and make necessary changes in the Dataframe to address null values. In this article, we will learn about the usage of the IsNot method in Pyspark to remove the NULL values from our dataframe.
What is PySpark?
PySpark is a Python API for Apache Spark which is an open-source distributed computing system designed for large-scale data processing. It enables data scientists to utilize Spark's capabilities using Python, allowing for seamless data manipulation, analysis, and machine learning at scale.
The isNotNull Method
The isNotNull() method is provided by Spark SQL and operates on the Column class to check whether the column contains any null values. The return type of this method is boolean i.e. the method returns True if it does not find any null values or returns False if null value is found in the particular column. It is used with filter method of DataFrame class which takes condition as an argument to filter out particular rows.
Using isNotNull Method in Pyspark
To use the isNotNull
method, apply it to a DataFrame column and then use the filter
function to retain only the rows that meet the condition. The process is straightforward and integrates well with PySpark's DataFrame operations.
Example:
Here we will create adataframe with with some null values using Python in Pyspark. We have used None which is an inbuilt datatype in Python to represent null values. The DataFrame is created using list of Row objects which takes column names and their respective values as arguments. To visualize the output in the form of table, we have used show() method of DataFrame object.
Python
from pyspark.sql import SparkSession
from pyspark.sql import Row
spark = SparkSession.builder.appName("GeeksForGeeks").getOrCreate()
data = [
Row(name="Alpha", age=20, marks=54),
Row(name="Beta", age=None, marks=None),
Row(name="Omega", age=17, marks=85),
Row(name="Sigma", age=None, marks=62)
]
df = spark.createDataFrame(data)
df.show()
Output:
DataFrame createdExample:
In this example, we will filter out null values in the Age column of the DataFrame by using filter method and passing isNotNull() method which will check whether the column contains null value or not. It will display only those rows which does not contain a Null value.
Python
new_df = df.filter(df["age"].isNotNull())
# or
# new_df = df.filter(df.age.isNotNull())
new_df.show()
Output:
Filtered DataFrame based on AgeExample:
In this example, we will filter out the null value rows of the marks column and display the rest of the rows.
Python
new_df = df.filter(df["marks"].isNotNull())
# or
# new_df = df.filter(df.marks.isNotNull())
new_df.show()
Output:
Filtered DataFrame based on MarksExample:
We can also select multiple columns, by useing AND operator to check null values on two or more columns.
Python
new_df = df.filter(df["age"].isNotNull() & df["marks"].isNotNull())
# or
# new_df = df.filter(df.age.isNotNull() & df.marks.isNotNull())
new_df.show()
Output:
Filtered DataFrame based on Age and MarksConclusion
In this article we have seen how to filter out null values from one column or multiple columns using isNotNull() method provided by PySpark Library. We have provided suitable examples which can be easily integrated to your personal use cases.
Q. What is the isNull() method of Column object?
Column.isNull() is a method which returns True if the expression contains null values and can be used along with filter() method to select all the rows of the DataFrame containing null values. It is quite opposite to isNotNull() which is used to find rows which does not contain null values.
Q. What is the difference between show() and collect()?
Both methods are used to display the DataFrame. DataFrame.show() method represent the DataFrame as a table in the terminal and takes an integer "n" as an optional argument which displays first "n" rows of the table. It also has other optional arguments such as truncate which can be set to any integer numbers to limit the length of data present inside each cell and vertical which is a boolean argument used to show the DataFrame vertically. DataFrame.collect() method is used to display DataFrame as a list of Row objects. The output generated by the collect method is not visually represented as a table.
Q. Is it mandatory to use Row object to create DataFrame?
It is not mandatory to use Row object to generate DataFrame. We can also create a Spark DataFrame using a list of tuples and another list containing column name (header). Both these lists are then passed to createDataFrame(data, header) method and the result is similar to that of using Row objects to create DataFrames.
Similar Reads
How to use Is Not Null in PySpark
In data processing, handling null values is a crucial task to ensure the accuracy and reliability of the analysis. PySpark, the Python API for Apache Spark, provides powerful methods to handle null values efficiently. In this article, we will go through how to use the isNotNull method in PySpark to
4 min read
How to Use âNOT INâ Operator in R?
In this article, we will discuss NOT IN Operator in R Programming Language. NOT IN Operator is used to check whether the element in present or not. The symbol used for IN operator is "%in%". For NOT IN operator we have to add " ! " operator before that , so the symbol for NOT IN operator is "! %in%"
2 min read
How to Use âNOT INâ Filter in Pandas?
The "NOT IN"(â¼) filter is a membership operator used to check whether the data is present in DataFrame or not. Pandas library does not have the direct NOT IN filter in Python, but we can perform the NOT IN filter by negating the isin() operator of Pandas. In this tutorial, we will provide a step-by
3 min read
How to Install PySpark in Kaggle
PySpark is the Python API for powerful distributed computing framework called Apache Spark. Among its many usage areas, I would say it majorly includes big data processing, machine learning, and real-time analytics. Running PySpark within the hosted environment of Kaggle would be super great if you
4 min read
How to Install PySpark in Jupyter Notebook
PySpark is a Python library for Apache Spark, a powerful framework for big data processing and analytics. Integrating PySpark with Jupyter Notebook provides an interactive environment for data analysis with Spark. In this article, we will know how to install PySpark in Jupyter Notebook.Setting Up Ju
2 min read
UDF to sort list in PySpark
The most useful feature of Spark SQL used to create a reusable function in Pyspark is known as UDF or User defined function in Python. The column type of the Pyspark can be String, Integer, Array, etc. There occurs some situations in which you have got ArrayType column in Pyspark data frame and you
3 min read
How to sort by value in PySpark?
In this article, we are going to sort by value in PySpark.Creating RDD for demonstration:Pythonfrom pyspark.sql import SparkSession, Row # creating sparksession and giving an app name spark = SparkSession.builder.appName('sparkdf').getOrCreate() # create 2 Rows with 3 columns data = Row(First_name="
2 min read
Convert Python Functions into PySpark UDF
In this article, we are going to learn how to convert Python functions into Pyspark UDFs We will discuss the process of converting Python functions into PySpark User-Defined Functions (UDFs). PySpark UDFs are a powerful tool for data processing and analysis, as they allow for the use of Python funct
4 min read
How to join on multiple columns in Pyspark?
In this article, we will discuss how to join multiple columns in PySpark Dataframe using Python. Let's create the first dataframe: Python3 # importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app
3 min read
Convert pair to value using map() in Pyspark
In this article, we are going to learn how to use map() to convert (key, value) pair to value and keys only using Pyspark in Python. PySpark is the Python library for Spark programming. It is an API for interacting with the Spark cluster using the Python programming language. PySpark provides a simp
3 min read