How to get distinct rows in dataframe using PySpark? Last Updated : 30 May, 2021 Comments Improve Suggest changes Like Article Like Report In this article we are going to get the distinct data from pyspark dataframe in Python, So we are going to create the dataframe using a nested list and get the distinct data. We are going to create a dataframe from pyspark list bypassing the list to the createDataFrame() method from pyspark, then by using distinct() function we will get the distinct rows from the dataframe. Syntax: dataframe.distinct() Where dataframe is the dataframe name created from the nested lists using pyspark Example 1: Python code to get the distinct data from college data in a data frame created by list of lists. Python3 # importing module import pyspark # importing sparksession from # pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving # an app name spark = SparkSession.builder.appName('sparkdf').getOrCreate() # list of college data data = [["1", "bobby", "vvit"], ["2", "sravan", "jntuk"], ["3", "rohith", "AU"], ["4", "sridevi", "GVRS"], ["1", "bobby", "vvit"]] # specify column names columns = ['ID', 'NAME', 'COLLEGE'] # creating a dataframe from the # lists of data dataframe = spark.createDataFrame(data, columns) print('Actual data in dataframe') dataframe.show() Output: Now Get the distinct rows in dataframe: Python3 print('distinct data') # display distinct data dataframe.distinct().show() Output: Example 2: Python program to find distinct values from 1 row Python3 # importing module import pyspark # importing sparksession from # pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving # an app name spark = SparkSession.builder.appName('sparkdf').getOrCreate() # list of college data data = [["1", "bobby", "vvit"]] # specify column names columns = ['ID', 'NAME', 'COLLEGE'] # creating a dataframe from the # list of data dataframe = spark.createDataFrame(data, columns) print('Actual data in dataframe') dataframe.show() Output: Now Get the distinct rows in dataframe: Python3 print('distinct data') # display distinct data from # the dataframe dataframe.distinct().show() Output: Comment More infoAdvertise with us Next Article How to get distinct rows in dataframe using PySpark? S sravankumar_171fa07058 Follow Improve Article Tags : Python Python-Pyspark Practice Tags : python Similar Reads How to duplicate a row N time in Pyspark dataframe? In this article, we are going to learn how to duplicate a row N times in a PySpark DataFrame. Method 1: Repeating rows based on column value In this method, we will first make a PySpark DataFrame using createDataFrame(). In our example, the column "Y" has a numerical value that can only be used here 4 min read Show distinct column values in PySpark dataframe In this article, we are going to display the distinct column values from dataframe using pyspark in Python. For this, we are using distinct() and dropDuplicates() functions along with select() function. Let's create a sample dataframe. Python3 # importing module import pyspark # importing sparksessi 2 min read How to slice a PySpark dataframe in two row-wise dataframe? In this article, we are going to learn how to slice a PySpark DataFrame into two row-wise. Slicing a DataFrame is getting a subset containing all rows from one index to another. Method 1: Using limit() and subtract() functions In this method, we first make a PySpark DataFrame with precoded data usin 4 min read Split Dataframe in Row Index in Pyspark In this article, we are going to learn about splitting Pyspark data frame by row index in Python. In data science. there is a bulk of data and their is need of data processing and lots of modules, functions and methods are available to process data. In this article we are going to process data by sp 5 min read How to get a value from the Row object in PySpark Dataframe? In this article, we are going to learn how to get a value from the Row object in PySpark DataFrame. Method 1 : Using __getitem()__ magic method We will create a Spark DataFrame with at least one row using createDataFrame(). We then get a Row object from a list of row objects returned by DataFrame.co 5 min read Like