Python PySpark pivot() Function
Last Updated :
24 Sep, 2024
The pivot() function in PySpark is a powerful method used to reshape a DataFrame by transforming unique values from one column into multiple columns in a new DataFrame, while aggregating data in the process. The function takes a set of unique values from a specified column and turns them into separate columns.
In this article, we will go through a detailed example of how to use the pivot() function in PySpark, covering its usage step by step.
Introduction to PySpark pivot()
In PySpark, the pivot() function is part of the DataFrame API. It allows us to convert rows into columns by specifying:
- A column whose values will become new columns.
- Optionally, an aggregation method to apply during the pivot process, e.g., sum(), avg, etc.
The syntax of the pivot() function is:
df.pivot(pivot_column, [values])
Where:
- pivot_column: The column whose unique values will become the column headers.
- values: An optional list of values from pivot_column to include. If not specified, all unique values will be used.
Example Data:
Let's consider the following DataFrame, which contains sales data of different products in various regions:
Product | Region | Sale |
---|
A | East | 100 |
---|
A | West | 150 |
---|
B | East | 200 |
---|
B | West | 250 |
---|
C | East | 300 |
---|
C | West | 350 |
---|
We will pivot this data so that each region becomes a column, with sales as the values.
Step-by-Step Implementation
1. Initial DataFrame Setup
First, let's create a PySpark DataFrame for the sales data
Python
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
# Create a SparkSession
spark = SparkSession.builder.appName("PySpark Pivot Example").getOrCreate()
# Sample data
data = [
("A", "East", 100),
("A", "West", 150),
("B", "East", 200),
("B", "West", 250),
("C", "East", 300),
("C", "West", 350)
]
# Create a DataFrame
columns = ["Product", "Region", "Sales"]
df = spark.createDataFrame(data, columns)
# Show the initial DataFrame
df.show()
Output:
Creating a PySpark DataFrame2. Using pivot()
Now we will use the pivot() function to reorganize the data. We want to pivot on the Region column, so that East and West become separate columns. We will aggregate the sales by Product.
Python
# ...
# Pivot the DataFrame
pivot_df = df.groupBy("Product").pivot("Region").sum("Sales")
# Show the pivoted DataFrame
pivot_df.show()
Output:
Using PySpark Pivot FunctionIn this output:
- Each Product is in a single row.
- The East and West regions have become columns.
- The values in the new columns represent the Sales values.
3. Aggregating with pivot()
We can also apply additional aggregation functions during the pivot process. For example, if we had multiple rows for each Product and Region, we could use avg(), min(), max(), or other aggregate functions.
For example, if we have multiple sales entries for each product in the same region, we can sum them during the pivot.
Python
# ...
# Sample data with multiple entries for the same product-region combination
data_with_duplicates = [
("A", "East", 100),
("A", "East", 50),
("A", "West", 150),
("B", "East", 200),
("B", "West", 250),
("B", "West", 100),
("C", "East", 300),
("C", "West", 350)
]
# Create a new DataFrame with duplicate data
df_with_duplicates = spark.createDataFrame(data_with_duplicates, columns)
# Pivot and sum the sales
pivot_df_with_aggregation_sum = df_with_duplicates.groupBy("Product").pivot("Region").sum("Sales")
# Show the result
pivot_df_with_aggregation_sum.show()
# Pivot and sum the sales
pivot_df_with_aggregation_avg = df_with_duplicates.groupBy("Product").pivot("Region").avg("Sales")
# Show the result
pivot_df_with_aggregation_avg.show()
Output:
Aggregating with Pivot in PySparkHere, for Product A, the sales from two entries in the East region have been summed (100 + 50 = 150). For Product B, the sales in the West region have also been aggregated (250 + 100 = 350).
And for Product A, the sales from two entries in the East region have been averaged ((100 + 50)/2 = 75.0). For Product B, the sales in the West region have also been aggregated ((250 + 100)/2 = 175.0).
Conclusion
The pivot() function in PySpark is a powerful tool for transforming data. It allows us to convert row-based data into column-based data by pivoting on a specific column's values. In this article, we demonstrated how to pivot data using PySpark, with a focus on sales data by region. Additionally, we showed how to apply aggregation methods like sum() during the pivot process. By utilizing pivot(), we can restructure our DataFrame to make it more suitable for further analysis or reporting.
When using pivot(), keep in mind:
- We should choose the correct column for pivoting based on the data structure.
- Ensure that the aggregation method used suits our needs, whether it be sum(), avg(), or another method.
- Use groupBy() before pivot() to ensure the data is grouped correctly.
Similar Reads
Python PySpark sum() Function
PySpark, the Python API for Apache Spark, is a powerful tool for big data processing and analytics. One of its essential functions is sum(), which is part of the pyspark.sql.functions module. This function allows us to compute the sum of a column's values in a DataFrame, enabling efficient data anal
3 min read
Convert Python Functions into PySpark UDF
In this article, we are going to learn how to convert Python functions into Pyspark UDFs We will discuss the process of converting Python functions into PySpark User-Defined Functions (UDFs). PySpark UDFs are a powerful tool for data processing and analysis, as they allow for the use of Python funct
4 min read
PySpark Window Functions
PySpark Window function performs statistical operations such as rank, row number, etc. on a group, frame, or collection of rows and returns results for each row individually. It is also popularly growing to perform data transformations. We will understand the concept of window functions, syntax, and
8 min read
pandas.lreshape() function in Python
This method is used to reshape long-format data to wide. This is the generalized inverse of DataFrame.pivot. Syntax : pandas.lreshape(data, groups, dropna=True, label=None) Arguments : data : DataFramegroups : dict {new_name : list_of_columns}dropna : boolean, default True Below is the implementatio
1 min read
Cleaning Data with PySpark Python
In this article, we are going to know how to cleaning of data with PySpark in Python. Pyspark is an interface for Apache Spark. Apache Spark is an Open Source Analytics Engine for Big Data Processing. Today we will be focusing on how to perform Data Cleaning using PySpark. We will perform Null Value
5 min read
Pivot String column on Pyspark Dataframe
Pivoting in data analysis refers to the transformation of data from a long format to a wide format by rotating rows into columns. In PySpark, pivoting is used to restructure DataFrames by turning unique values from a specific column (often categorical) into new columns, with the option to aggregate
4 min read
pandas.crosstab() function in Python
pandas.crosstab() function in Python is used to compute a cross-tabulation (contingency table) of two or more categorical variables. By default, computes a frequency table of the factors unless an array of values and an aggregation function are passed. It also supports aggregation when additional da
6 min read
Convert Python Dictionary List to PySpark DataFrame
In this article, we will discuss how to convert Python Dictionary List to Pyspark DataFrame. It can be done in these ways: Using Infer schema.Using Explicit schemaUsing SQL Expression Method 1: Infer schema from the dictionary We will pass the dictionary directly to the createDataFrame() method. Syn
3 min read
How to use Is Not in PySpark
Null values are undefined or empty data present in a dataframe. These null values may be added due to some errors in data transfer or technical glitches. We should identify null values and make necessary changes in the Dataframe to address null values. In this article, we will learn about the usage
4 min read
Convert PySpark DataFrame to Dictionary in Python
In this article, we are going to see how to convert the PySpark data frame to the dictionary, where keys are column names and values are column values. Before starting, we will create a sample Dataframe: C/C++ Code # Importing necessary libraries from pyspark.sql import SparkSession # Create a spark
3 min read