Drop One or Multiple Columns From PySpark DataFrame
Last Updated :
17 Jun, 2021
In this article, we will discuss how to drop columns in the Pyspark dataframe.
In pyspark the drop() function can be used to remove values/columns from the dataframe.
Syntax: dataframe_name.na.drop(how=”any/all”,thresh=threshold_value,subset=[“column_name_1″,”column_name_2”])
- how – This takes either of the two values ‘any’ or ‘all’. ‘any’, drop a row if it contains NULLs on any columns and ‘all’, drop a row only if all columns have NULL values. By default it is set to ‘any’
- thresh – This takes an integer value and drops rows that have less than that thresh hold non-null values. By default it is set to ‘None’.
- subset – This parameter is used to select a specific column to target the NULL values in it. By default it’s ‘None
Python code to create student dataframe with three columns:
Python3
# importing module
import pyspark
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list of employee data with 5 row values
data =[["1", "sravan", "company 1"],
["3", "bobby", "company 3"],
["2", "ojaswi", "company 2"],
["1", "sravan", "company 1"],
["3", "bobby", "company 3"],
["4", "rohith", "company 2"],
["5", "gnanesh", "company 1"]]
# specify column names
columns = ['Employee ID','Employee NAME','Company Name']
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data,columns)
dataframe.show()
Output:
+-----------+-------------+------------+
|Employee ID|Employee NAME|Company Name|
+-----------+-------------+------------+
| 1| sravan| company 1|
| 3| bobby| company 3|
| 2| ojaswi| company 2|
| 1| sravan| company 1|
| 3| bobby| company 3|
| 4| rohith| company 2|
| 5| gnanesh| company 1|
+-----------+-------------+------------+
Example 1: Delete a single column.
Here we are going to delete a single column from the dataframe.
Syntax: dataframe.drop('column name')
Code:
Python3
# delete single column
dataframe = dataframe.drop('Employee ID')
dataframe.show()
Output:
+-------------+------------+
|Employee NAME|Company Name|
+-------------+------------+
| sravan| company 1|
| bobby| company 3|
| ojaswi| company 2|
| sravan| company 1|
| bobby| company 3|
| rohith| company 2|
| gnanesh| company 1|
+-------------+------------+Example 2:
Example 2: Delete multiple columns.
Here we will delete multiple columns from the dataframe.
Syntax: dataframe.drop(*('column 1','column 2','column n'))
Code:
Python3
# delete two columns
dataframe = dataframe.drop(*('Employee NAME',
'Employee ID'))
dataframe.show()
Output:
+------------+
|Company Name|
+------------+
| company 1|
| company 3|
| company 2|
| company 1|
| company 3|
| company 2|
| company 1|
+------------+
Example 3: Delete all columns
Here we will delete all the columns from the dataframe, for this we will take column's name as a list and pass it into drop().
Python3
list = ['Employee ID','Employee NAME','Company Name']
# delete two columns
dataframe = dataframe.drop(*list)
dataframe.show()
Output:
++
||
++
||
||
||
||
||
||
||
++
Similar Reads
How to drop one or multiple columns in Pandas DataFrame Let's learn how to drop one or more columns in Pandas DataFrame for data manipulation. Drop Columns Using df.drop() MethodLet's consider an example of the dataset (data) with three columns 'A', 'B', and 'C'. Now, to drop a single column, use the drop() method with the columnâs name.Pythonimport pand
4 min read
Python PySpark - DataFrame filter on multiple columns In this article, we are going to filter the dataframe on multiple columns by using filter() and where() function in Pyspark in Python. Creating Dataframe for demonestration: Python3 # importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSessio
2 min read
How to drop multiple column names given in a list from PySpark DataFrame ? In this article, we are going to drop multiple columns given in the list in Pyspark dataframe in Python. For this, we will use the drop() function. This function is used to remove the value from dataframe. Syntax: dataframe.drop(*['column 1','column 2','column n']) Where, dataframe is the input data
2 min read
How to Add Multiple Columns in PySpark Dataframes ? In this article, we will see different ways of adding Multiple Columns in PySpark Dataframes. Let's create a sample dataframe for demonstration: Dataset Used: Cricket_data_set_odi Python3 # import pandas to read json file import pandas as pd # importing module import pyspark # importing sparksessio
2 min read
How to select and order multiple columns in Pyspark DataFrame ? In this article, we will discuss how to select and order multiple columns from a dataframe using pyspark in Python. For this, we are using sort() and orderBy() functions along with select() function. Methods UsedSelect(): This method is used to select the part of dataframe columns and return a copy
2 min read
How to delete columns in PySpark dataframe ? In this article, we are going to delete columns in Pyspark dataframe. To do this we will be using the drop() function. This function can be used to remove values from the dataframe. Syntax: dataframe.drop('column name') Python code to create student dataframe with three columns: Python3 # importing
2 min read