How take a random row from a PySpark DataFrame?
Last Updated :
30 Jan, 2022
In this article, we are going to learn how to take a random row from a PySpark DataFrame in the Python programming language.
Method 1 : PySpark sample() method
PySpark provides various methods for Sampling which are used to return a sample from the given PySpark DataFrame.
Here are the details of the sample() method :
Syntax : DataFrame.sample(withReplacement,fractionfloat,seed)
It returns a subset of the DataFrame.
Parameters :
withReplacement : bool, optional
Sample with replacement or not (default False).
fractionfloat : optional
Fraction of rows to generate
seed : int, optional
Used to reproduce the same random sampling.
Example:
In this example, we need to add a fraction of float data type here from the range [0.0,1.0]. Using the formula :
Number of rows needed = Fraction * Total Number of rows
We can say that the fraction needed for us is 1/total number of rows.
Python
# importing the library and
# its SparkSession functionality
import pyspark
from pyspark.sql import SparkSession
# creating a session to make DataFrames
random_row_session = SparkSession.builder.appName(
'Random_Row_Session'
).getOrCreate()
# Pre-set data for our DataFrame
data = [['a', 1], ['b', 2], ['c', 3], ['d', 4]]
columns = ['Letters', 'Position']
# Creating a DataFrame
df = random_row_session.createDataFrame(data,
columns)
# Printing the DataFrame
df.show()
# Taking a sample of df and storing it in #df2
# please not that the second argument here is a fraction
# of the data set we need(fraction is in float)
# number of rows = fraction * total number of rows
df2 = df.sample(False, 1.0/len(df.collect()))
# printing the sample row which is a DataFrame
df2.show()
Output :
+-------+--------+
|Letters|Position|
+-------+--------+
| a| 1|
| b| 2|
| c| 3|
| d| 4|
+-------+--------+
+-------+--------+
|Letters|Position|
+-------+--------+
| b| 2|
+-------+--------+
Method 2: Using takeSample() method
We first convert the PySpark DataFrame to an RDD. Resilient Distributed Dataset (RDD) is the most simple and fundamental data structure in PySpark. They are immutable collections of data of any data type.
We can get RDD of a Data Frame using DataFrame.rdd and then use the takeSample() method.
Syntax of takeSample() :
takeSample(withReplacement, num, seed=None)
Parameters :
withReplacement : bool, optional
Sample with replacement or not (default False).
num : int
the number of sample values
seed : int, optional
Used to reproduce the same random sampling.
Returns : It returns num number of rows from the DataFrame.
Example: In this example, we are using takeSample() method on the RDD with the parameter num = 1 to get a Row object. num is the number of samples.
Python
# importing the library and
# its SparkSession functionality
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql import Row
# creating a session to make DataFrames
random_row_session = SparkSession.builder.appName(
'Random_Row_Session'
).getOrCreate()
# Pre-set data for our DataFrame
data = [['a', 1], ['b', 2], ['c', 3], ['d', 4]]
columns = ['Letters', 'Position']
# Creating a DataFrame
df = random_row_session.createDataFrame(data,
columns)
# Printing the DataFrame
df.show()
# Getting RDD object from the DataFrame
rdd = df.rdd
# Taking a single sample of from the RDD
# Putting num = 1 in the takeSample() function
rdd_sample = rdd.takeSample(withReplacement=False,
num=1)
print(rdd_sample)
Output :
+-------+--------+
|Letters|Position|
+-------+--------+
| a| 1|
| b| 2|
| c| 3|
| d| 4|
+-------+--------+
[Row(Letters='c', Position=3)]
Method 3: Convert the PySpark DataFrame to a Pandas DataFrame and use the sample() method
We can use toPandas() function to convert a PySpark DataFrame to a Pandas DataFrame. This method should only be used if the resulting Pandas’ DataFrame is expected to be small, as all the data is loaded into the driver’s memory. This is an experimental method.
We will then use the sample() method of the Pandas library. It returns a random sample from an axis of the Pandas DataFrame.
Syntax : PandasDataFrame.sample(n=None, frac=None, replace=False, weights=None, random_state=None, axis=None, ignore_index=False)
Example:
In this example, we will be converting our PySpark DataFrame to a Pandas DataFrame and using the Pandas sample() function on it.
Python
# importing the library and
# its SparkSession functionality
import pyspark
from pyspark.sql import SparkSession
# creating a session to make DataFrames
random_row_session = SparkSession.builder.appName(
'Random_Row_Session'
).getOrCreate()
# Pre-set data for our DataFrame
data = [['a', 1], ['b', 2], ['c', 3], ['d', 4]]
columns = ['Letters', 'Position']
# Creating a DataFrame
df = random_row_session.createDataFrame(data,
columns)
# Printing the DataFrame
df.show()
# Converting the DataFrame to
# a Pandas DataFrame and taking a sample row
pandas_random = df.toPandas().sample()
# Converting the sample into
# a PySpark DataFrame
df_random = random_row_session.createDataFrame(pandas_random)
# Showing our randomly selected row
df_random.show()
Output :
+-------+--------+
|Letters|Position|
+-------+--------+
| a| 1|
| b| 2|
| c| 3|
| d| 4|
+-------+--------+
+-------+--------+
|Letters|Position|
+-------+--------+
| b| 2|
+-------+--------+
Similar Reads
How to select a range of rows from a dataframe in PySpark ?
In this article, we are going to select a range of rows from a PySpark dataframe. It can be done in these ways: Using filter().Using where().Using SQL expression. Creating Dataframe for demonstration: Python3 # importing module import pyspark # importing sparksession from pyspark.sql module from pys
3 min read
How to Randomly Select rows from Pandas DataFrame
In Pandas, it is possible to select rows randomly from a DataFrame with different methods. Randomly selecting rows can be useful for tasks like sampling, testing or data exploration.Creating Sample Pandas DataFrameFirst, we will create a sample Pandas DataFrame that we will use further in our articl
3 min read
PySpark Row using on DataFrame and RDD
You can access the rows in the data frame like this: Attribute, dictionary value. Row allows you to create row objects using named arguments. A named argument cannot be omitted to indicate that the value is "none" or does not exist. In this case, you should explicitly set this to None. Subsequent ch
6 min read
How to get a value from the Row object in PySpark Dataframe?
In this article, we are going to learn how to get a value from the Row object in PySpark DataFrame. Method 1 : Using __getitem()__ magic method We will create a Spark DataFrame with at least one row using createDataFrame(). We then get a Row object from a list of row objects returned by DataFrame.co
5 min read
Removing Blank Strings from a PySpark Dataframe
Cleaning and preprocessing data is a crucial step before it can be used for analysis or modeling. One of the common tasks in data preparation is removing empty strings from a Spark dataframe. A Spark dataframe is a distributed collection of data that is organized into rows and columns. It can be pro
4 min read
Get specific row from PySpark dataframe
In this article, we will discuss how to get the specific row from the PySpark dataframe. Creating Dataframe for demonstration: Python3 # importing module import pyspark # importing sparksession # from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession # and giving an app
4 min read
How to check for a substring in a PySpark dataframe ?
In this article, we are going to see how to check for a substring in PySpark dataframe. Substring is a continuous sequence of characters within a larger string size. For example, "learning pyspark" is a substring of "I am learning pyspark from GeeksForGeeks". Let us look at different ways in which w
5 min read
How to slice a PySpark dataframe in two row-wise dataframe?
In this article, we are going to learn how to slice a PySpark DataFrame into two row-wise. Slicing a DataFrame is getting a subset containing all rows from one index to another. Method 1: Using limit() and subtract() functions In this method, we first make a PySpark DataFrame with precoded data usin
4 min read
How to create an empty PySpark DataFrame ?
In PySpark, an empty DataFrame is one that contains no data. You might need to create an empty DataFrame for various reasons such as setting up schemas for data processing or initializing structures for later appends. In this article, weâll explore different ways to create an empty PySpark DataFrame
4 min read
Read Text file into PySpark Dataframe
In this article, we are going to see how to read text files in PySpark Dataframe. There are three ways to read text files into PySpark DataFrame. Using spark.read.text()Using spark.read.csv()Using spark.read.format().load() Using these we can read a single text file, multiple files, and all files fr
3 min read