0% found this document useful (0 votes)
7 views

PySpark Notes

The document discusses the evolution and significance of Big Data and Apache Spark, detailing the challenges of managing large volumes, varieties, velocities, and veracities of data. It highlights Spark's advantages over Hadoop, including speed, ease of use, and flexibility, and explains key concepts such as RDDs, DataFrames, and Datasets. Additionally, it provides practical examples of data manipulation using PySpark, including filtering, aggregating, joining, and handling missing data.

Uploaded by

Abhishek Dutta
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

PySpark Notes

The document discusses the evolution and significance of Big Data and Apache Spark, detailing the challenges of managing large volumes, varieties, velocities, and veracities of data. It highlights Spark's advantages over Hadoop, including speed, ease of use, and flexibility, and explains key concepts such as RDDs, DataFrames, and Datasets. Additionally, it provides practical examples of data manipulation using PySpark, including filtering, aggregating, joining, and handling missing data.

Uploaded by

Abhishek Dutta
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 31

The Birth of Big Data (The Beginning)

In Digitalia, every action of its citizens—tweets, purchases, selfies—adds to the kingdom’s


treasure trove of information. But managing this treasure is no small feat. Advisors break down
Big Data:

● Volume:
The size of data is staggering—petabytes, even zettabytes. Traditional databases
collapse under such weight. Imagine tracking every grain of sand on Earth; that’s the
challenge of volume.
● Variety:
Data doesn’t come neatly packaged. There’s structured data (like tables), semi-
structured data (like JSON or XML), and unstructured data (like videos or audio). A
single event, such as booking a flight, generates multiple types of data.
● Velocity:
Data moves at blinding speed, often in real-time. A streaming video platform like Netflix
gathers user data every second, which must be processed immediately to provide
recommendations or maintain quality.
● Veracity:
Not all data is reliable. Noise, inconsistencies, or outright inaccuracies can mislead
decision-makers. For example, social media data may contain spam or fake reviews.

The Rise of Distributed Computing (The Strategy)


The sage Distributed Computing introduces the kingdom to a game-changing strategy: Divide
and Conquer. Instead of one computer trying to process everything, many work together. Key
principles include:

● Parallelism: Multiple tasks run simultaneously.


● Fault Tolerance: If one machine fails, others pick up the slack.
● Scalability: The system grows by adding more machines.

For instance, imagine slicing a massive cake (data) and sharing it among hundreds of bakers
(machines). Together, they process the cake faster than one baker could. Frameworks like
Hadoop’s HDFS (Hadoop Distributed File System) lay the foundation for this approach.

Enter Apache Spark (The Hero)


Apache Spark enters as the revolutionary warrior in Big Data's saga, armed with unique
strengths:

● In-Memory Computing: Unlike older tools that write intermediate data to disk, Spark
keeps data in memory. This boosts speed dramatically, particularly for iterative tasks like
machine learning.
● Unified Platform: Spark handles batch processing (like Hadoop), real-time streaming,
machine learning, and graph analytics—all in one tool.
● Polyglot: Spark speaks multiple programming languages: Python (PySpark), Java,
Scala, and R, making it accessible to diverse developers.

The Hero's Journey (History and Evolution)


Spark’s story begins in 2009 at UC Berkeley’s AMPLab. Dissatisfied with Hadoop's slow
performance for iterative algorithms, researchers created Spark to handle both speed and
variety. Key milestones:

● 2013: Spark becomes open-source under the Apache Software Foundation.


● 2014: Spark 1.0 is released, with widespread adoption in industries from banking to e-
commerce.
● 2018: Spark 2.0 introduces Structured Streaming, enabling developers to process real-
time data more effectively.

Today, Spark powers everything from ride-sharing platforms (Uber) to streaming services
(Netflix).

Features and Use Cases (The Hero's Powers)


Spark boasts powerful abilities:

1. Speed: Handles petabytes of data at unmatched speeds.


2. Ease of Use: With PySpark, even beginners can process data using Python.
3. Flexibility: Works seamlessly with data stored in Hadoop, cloud systems, or local
machines.

Use Cases:

● Machine Learning: Train models on massive datasets in record time.


● Real-Time Analytics: Monitor stock prices or detect fraud as it happens.
● ETL (Extract, Transform, Load): Process and clean data efficiently for storage and
analysis.

Spark vs. Hadoop (The Rivalry)


Although Spark and Hadoop serve the same kingdom, they approach problems differently:

● Processing Speed: Spark’s in-memory processing is 10-100x faster than Hadoop’s


disk-based MapReduce.
● Ease of Use: Spark offers APIs in high-level languages like Python, whereas Hadoop
relies on Java.
● Flexibility: Spark supports real-time streaming, while Hadoop focuses on batch
processing.

Still, Spark can run on Hadoop’s distributed storage (HDFS), making them allies when needed.

What does PySpark?


Simply put what it does is to execute operations on distributed data. Thus, the operations also
need to be distributed. Some operations are simple, such as filtering out all items that don't
respect some rule. Others are more complex, such as groupBy which needs to move data
around, and join which needs to associate items from 2 or more datasets.

Another important fact is that input and output are stored in different formats, spark has
connectors to read and write those. But that means to serialize and deserialize them. While
being transparent, serialization is often the most expensive operation.

Finally, spark tries to keep data in memory for processing but it will [ser/deser]ialize data on
each worker locally when it doesn't fit in memory. Once again, it is done transparently but can
be costly

Difference between RDD vs Dataframe:

RDD
It's the first API provided by spark. To put is simply it is a not-ordered sequence of scala/java
objects distributed over a cluster. All operations executed on it are jvm methods (passed to
map, flatmap, groupBy, ...) that need to be serialized, send to all workers, and be applied to the
jvm objects there. This is pretty much the same as using a scala Seq, but distributed. It is
strongly typed, meaning that "if it compiles then it works" (if you don't cheat). However, there are
lots of distribution issues that can arise. Especially if spark doesn't know how to [de]serialize the
jvm classes and methods.

Dataframe
It came after and is semantically very different from RDD. The data are considered as tables
and operations such as sql operations can be applied on it. It is not typed at all, so error can
arise at any time during execution. However, there are I think 2 pros: (1) many people are used
to the table/sql semantic and operations, and (2) spark doesn't need to deserialize the whole
line to process one of its column, if the data format provide suitable column access. And many
do, such as the parquet file format that is the most commonly used.

Dataset
It is an improvement of Dataframe to bring some type-safety. Dataset are dataframe to which
we associate an "encoder" related to a jvm class. So spark can check that the data schema is
correct before executing the code. Note however that, we can read sometime that dataset are
strongly type, but it is not: it brings some strongly type safety where you cannot compile code
that use a Dataset with a type that is not what has been declared. But it is very easy to make
code that compile but still fail at runtime. This is because many dataset operations loose the
type (pretty much everything apart from filter). Still it is a huge improvements because even
when we make mistake, it will fail fast: failure happens when interpreting the spark DAG (i.e. at
start) instead of during data processing.

Pros and cons

● Dataset:
● pros: has optimized operations over column oriented storages
● pros: also many operations doesn't need deserialization
● pros: provide table/sql semantic if you like it (I don't ;)
● pros: dataset operations comes with an optimization engine
"catalyst" that improves the performance of your code. I'm not sure
however if it is really that great. If you know what you code, i.e. what
is done to the data, your code should be optimized by itself.
● cons: most operation loose typing
● cons: dataset operations can become too complicated for complex
algorithm that doesn't suit it. The 2 main limits I know are managing
invalid data and complex math algorithm.
● Dataframe:
● pros: required between dataset operations that lose type
● cons: just use Dataset it has all the advantages and more
● RDD:
● pros: (really) strongly typed
● pros: scala/java semantic. You can design your code pretty much
how you would for a single-jvm app that process in-memory
collections. Well, with functional semantic :)
● cons: full jvm deserialization is required to process data, at any step
mentioned before: after reading input, and between all processing
steps that requires data to be moved between worker, or stored
locally to manage memory bound.

Import data types


Many PySpark operations require that you use SQL functions or interact with native Spark
types. You can either directly import only those functions and types that you need, or you can
import the entire module.

# import all
from pyspark.sql.types import *
from pyspark.sql.functions import *

# import select functions and types


from pyspark.sql.types import IntegerType, StringType
from pyspark.sql.functions import floor, round

Because some imported functions might override Python built-in functions, some users choose
to import these modules using an alias. The following examples show a common alias used in
Apache Spark code examples:

import pyspark.sql.types as T
import pyspark.sql.functions as F

Create a DataFrame
There are several ways to create a DataFrame. Usually you define a DataFrame against a data
source such as a table or collection of files. Then as described in the Apache Spark
fundamental concepts section, use an action, such as display, to trigger the transformations to
execute. The display method outputs DataFrames.

Create a DataFrame with specified values


To create a DataFrame with specified values, use the createDataFrame method, where rows
are expressed as a list of tuples:

df_children = spark.createDataFrame(
data = [("Mikhail", 15), ("Zaky", 13), ("Zoya", 8)],
schema = ['name', 'age'])
display(df_children)

Notice in the output that the data types of columns of df_children are automatically inferred. You
can alternatively specify the types by adding a schema. Schemas are defined using the
StructType which is made up of StructFields that specify the name, data type and a boolean flag
indicating whether they contain a null value or not. You must import data types from
pyspark.sql.types.

from pyspark.sql.types import StructType, StructField, StringType, IntegerType


df_children_with_schema = spark.createDataFrame(
data = [("Mikhail", 15), ("Zaky", 13), ("Zoya", 8)],
schema = StructType([
StructField('name', StringType(), True),
StructField('age', IntegerType(), True)
])
)
display(df_children_with_schema)

1. Selecting, Filtering, and Sorting Data


Scenario:
You want to see employees older than 30 working in the "Engineering" department,
sorted by salary in descending order.

# Filter, Select, and Sort Example


filtered_df = df.filter((col("age") > 30) & (col("department") ==
"Engineering")) \
.select("id", "name", "age", "salary", "department") \
.orderBy(col("salary").desc())

filtered_df.show()

Explanation:

● filter() filters rows based on the condition (age > 30 and department ==
"Engineering").
● select() chooses specific columns to display.
● orderBy() sorts the results by salary in descending order.
● col() references a column to keep the code cleaner and avoid hardcoding
column names.

2. Aggregations and groupBy Operations


Scenario:
You want to calculate the average salary and total bonus per department.

# GroupBy and Aggregation Example


aggregated_df = df.groupBy("department") \
.agg(avg("salary").alias("avg_salary"),
sum("bonus").alias("total_bonus"))

aggregated_df.show()

Explanation:

● groupBy("department") groups the data by the "department" column.


● agg() applies aggregate functions like avg (average) and sum (sum).
● alias() renames the aggregated columns for readability.

3. Joining DataFrames
Scenario:
Assume you have another dataset with department managers. Join this data with the
original dataset to include manager names.

# Create a second DataFrame (department managers)


managers_data = [( "HR", "John"), ("Finance", "Mary"), ("Engineering",
"Steve"), ("Marketing", "Kate")]
managers_columns = ["department", "manager_name"]
managers_df = spark.createDataFrame(managers_data, managers_columns)

# Join with the original DataFrame


joined_df = df.join(managers_df, on="department", how="left")

joined_df.show()

Explanation:

● A new DataFrame managers_df is created containing department managers.


● join() combines df and managers_df using the "department" column.
● how="left" ensures all rows from the left DataFrame (df) are included.

4. Handling Missing Data


Scenario:
You want to replace missing bonus values with 0.

# Handling Missing Data


cleaned_df = df.fillna({"bonus": 0})
cleaned_df.show()
Explanation:

● fillna() replaces null (missing) values with the specified value (e.g., 0 for
"bonus").

5. Built-in Functions (col, lit, when, etc.)


Scenario:
You want to add a new column salary_category based on the salary values.

# Add a Derived Column using `when` and `col`


updated_df = df.withColumn("salary_category",
when(col("salary") > 100000, "High")
.when(col("salary").between(60000, 100000), "Medium")
.otherwise("Low"))

updated_df.select("id", "name", "salary", "salary_category").show()

Explanation:

● withColumn() adds a new column to the DataFrame.


● when() and otherwise() apply conditional logic to categorize salaries.
● col() references the "salary" column.

6. User-Defined Functions (UDFs)


Scenario:
You want to mask employee names for data privacy by only showing the first letter.

from pyspark.sql.functions import udf


from pyspark.sql.types import StringType

# Define a UDF to mask names


def mask_name(name):
return name[0] + "*" * (len(name) - 1)

mask_name_udf = udf(mask_name, StringType())

# Apply the UDF to create a new column


masked_df = df.withColumn("masked_name", mask_name_udf(col("name")))
masked_df.select("id", "name", "masked_name").show()

Explanation:

● A Python function mask_name() is defined to mask the name.


● udf() registers the function as a User-Defined Function with Spark.
● withColumn() applies the UDF to create a new column.

Example 1: Calculate Years Until Retirement


Scenario:
You want to calculate how many years each employee has until retirement, assuming the
retirement age is 65.

from pyspark.sql.types import IntegerType

# Define a UDF to calculate years until retirement

def years_to_retirement(age):

return max(65 - age, 0)

# Register the UDF

retirement_udf = udf(years_to_retirement, IntegerType())


# Apply the UDF to create a new column

retirement_df = df.withColumn("years_to_retirement",
retirement_udf(col("age")))

# Display the result

retirement_df.select("id", "name", "age",


"years_to_retirement").show()

Explanation:

● years_to_retirement() is a Python function that calculates the difference


between the retirement age (65) and the employee's current age.
● udf() registers the function with Spark and specifies the return type as
IntegerType.
● withColumn() creates a new column, years_to_retirement, by applying the
UDF.

Example 2: Categorize Cities by Region


Scenario:
You want to add a column categorizing cities into regions (e.g., "East", "West",
"Central").

# Define a UDF to categorize cities by region

def city_to_region(city):

if city in ["New York", "Houston"]:

return "East"

elif city in ["Los Angeles", "San Francisco"]:

return "West"

elif city == "Chicago":


return "Central"

else:

return "Unknown"

# Register the UDF

region_udf = udf(city_to_region, StringType())

# Apply the UDF to create a new column

region_df = df.withColumn("region", region_udf(col("city")))

# Display the result

region_df.select("id", "city", "region").show()

Explanation:

● city_to_region() is a Python function that maps each city to a specific region.


● udf() registers the function with Spark and specifies the return type as
StringType.
● withColumn() adds a new column, region, to the DataFrame.

7. Window Functions
Scenario:
You want to rank employees within each department by their salary.

# Define a window partitioned by department and ordered by salary


window_spec =
Window.partitionBy("department").orderBy(col("salary").desc())
# Add a rank column
ranked_df = df.withColumn("rank", row_number().over(window_spec))
ranked_df.select("id", "name", "department", "salary", "rank").show()

Explanation:

● Window.partitionBy() defines the window (grouping by department).


● orderBy() sorts salaries within each partition in descending order.
● row_number() assigns a rank based on the order.

Here are examples of various Window Functions using the same dataset:

1. Ranking Employees by Salary within Departments


Scenario: Rank employees within each department based on their salary in descending
order.

from pyspark.sql.window import Window


from pyspark.sql.functions import row_number

# Define a Window specification


window_spec_rank =
Window.partitionBy("department").orderBy(col("salary").desc())

# Add a rank column


ranked_df = df.withColumn("rank", row_number().over(window_spec_rank))

# Display the results


ranked_df.select("id", "name", "department", "salary", "rank").show()

Explanation:

● partitionBy("department"): Groups employees by department.


● orderBy(col("salary").desc()): Orders employees within each department
by salary in descending order.
● row_number(): Assigns a unique rank to each employee within the partition.
2. Calculate the Running Total of Salaries within Departments
Scenario: Calculate a cumulative (running) sum of salaries for each department.

from pyspark.sql.functions import sum

# Define a Window specification


window_spec_cumsum =
Window.partitionBy("department").orderBy("salary")

# Add a cumulative sum column


cumsum_df = df.withColumn("running_total_salary",
sum("salary").over(window_spec_cumsum))

# Display the results


cumsum_df.select("id", "name", "department", "salary",
"running_total_salary").show()

Explanation:

● sum("salary").over(window_spec_cumsum): Computes a cumulative sum of


salaries ordered by salary within each department.

3. Average Salary within Departments


Scenario: Calculate the average salary for each department and include it as a column
for all employees.

from pyspark.sql.functions import avg

# Define a Window specification


window_spec_avg = Window.partitionBy("department")

# Add an average salary column


avg_salary_df = df.withColumn("avg_salary",
avg("salary").over(window_spec_avg))
# Display the results
avg_salary_df.select("id", "name", "department", "salary",
"avg_salary").show()

Explanation:

● avg("salary").over(window_spec_avg): Computes the average salary for


each department.
● No ordering is needed since we are calculating an aggregate for the entire
partition.

4. Difference Between Current Salary and Average Salary


Scenario: Calculate the difference between each employee's salary and the average
salary in their department.

# Add a column for salary difference from the average


salary_diff_df = avg_salary_df.withColumn("salary_diff", col("salary")
- col("avg_salary"))

# Display the results


salary_diff_df.select("id", "name", "department", "salary",
"avg_salary", "salary_diff").show()

Explanation:

● col("salary") - col("avg_salary"): Computes the difference between the


current salary and the average salary for the department.

5. Assign Dense Rank Based on Bonus within Departments


Scenario: Assign a dense rank to employees based on their bonuses within each
department.
from pyspark.sql.functions import dense_rank

# Define a Window specification


window_spec_dense_rank =
Window.partitionBy("department").orderBy(col("bonus").desc())

# Add a dense rank column


dense_rank_df = df.withColumn("dense_rank",
dense_rank().over(window_spec_dense_rank))

# Display the results


dense_rank_df.select("id", "name", "department", "bonus",
"dense_rank").show()

Explanation:

● dense_rank(): Assigns ranks like row_number but ensures no gaps in ranking if


there are ties.
● orderBy(col("bonus").desc()): Orders employees by bonuses within each
department.

6. Lead and Lag Functions


Scenario: Find the previous and next salary for each employee within their department
based on salary order.

from pyspark.sql.functions import lead, lag

# Define a Window specification


window_spec_lead_lag =
Window.partitionBy("department").orderBy("salary")

# Add previous and next salary columns


lead_lag_df = df.withColumn("prev_salary",
lag("salary").over(window_spec_lead_lag)) \
.withColumn("next_salary",
lead("salary").over(window_spec_lead_lag))
# Display the results
lead_lag_df.select("id", "name", "department", "salary",
"prev_salary", "next_salary").show()

Explanation:

● lag("salary"): Retrieves the salary of the previous employee within the


department.
● lead("salary"): Retrieves the salary of the next employee within the
department.

7. Row Number with a Custom Partition


Scenario: Number employees across all departments without grouping them by
department.

# Define a Window specification without partitioning


window_spec_row_number = Window.orderBy("salary")

# Add a row number column


row_number_df = df.withColumn("row_number",
row_number().over(window_spec_row_number))

# Display the results


row_number_df.select("id", "name", "department", "salary",
"row_number").show()

Explanation:

● Window.orderBy("salary"): Orders the entire DataFrame by salary, ignoring


partitions.
● row_number(): Assigns a unique row number to each employee.

8. Percent Rank
Scenario: Calculate the percentile rank of each employee's salary within their
department.
from pyspark.sql.functions import percent_rank

# Define a Window specification


window_spec_percent_rank =
Window.partitionBy("department").orderBy("salary")

# Add a percent rank column


percent_rank_df = df.withColumn("percent_rank",
percent_rank().over(window_spec_percent_rank))

# Display the results


percent_rank_df.select("id", "name", "department", "salary",
"percent_rank").show()

Explanation:

● percent_rank(): Computes the percentile rank of each employee within their


department based on their salary.
5. Advanced topics in PySpark

1. Writing SQL Queries on DataFrames


Scenario: You want to find the top 3 employees with the highest salaries.

# Register the DataFrame as a temporary SQL view


df.createOrReplaceTempView("employees")

# Write SQL query to fetch the top 3 employees with the highest
salaries
top_3_employees = spark.sql("""
SELECT id, name, department, salary
FROM employees
ORDER BY salary DESC
LIMIT 3
""")

# Display the result


top_3_employees.show()

Explanation:

● createOrReplaceTempView("employees"): Registers the DataFrame as a


temporary SQL view named employees.
● spark.sql(): Executes a SQL query to retrieve data from the temporary view.
● ORDER BY salary DESC: Sorts employees by salary in descending order.
● LIMIT 3: Restricts the output to the top 3 rows.

2. Registering DataFrames as Temporary Views


Scenario: You want to analyze employees by department using SQL.

# Register the DataFrame as a global temporary view


df.createGlobalTempView("global_employees")

# Query the global temporary view


department_analysis = spark.sql("""
SELECT department, COUNT(*) AS employee_count, AVG(salary) AS
avg_salary
FROM global_temp.global_employees
GROUP BY department
""")

# Display the result


department_analysis.show()

Explanation:

● createGlobalTempView("global_employees"): Registers the DataFrame as a


global temporary view that is accessible across sessions.
● GROUP BY department: Groups data by the department column.
● COUNT(*) and AVG(salary): Calculate the number of employees and average
salary in each department.

3. Query Optimization with Catalyst Optimizer


Scenario: Observe query optimization when filtering data.

# Filter employees with a salary above 70,000 using DataFrame API


filtered_df = df.filter(col("salary") > 70000)

# Display the physical plan (Catalyst Optimizer optimizations)


filtered_df.explain(True)
Explanation:

● filter(col("salary") > 70000): Filters employees with a salary above


70,000.
● explain(True): Shows the physical execution plan, highlighting Catalyst
Optimizer's optimizations (e.g., predicate pushdown, projection pruning).

4. Working with Nested Structures (Arrays)


Scenario: Add a column containing an array of employee's city and department, then
extract these elements.

from pyspark.sql.functions import array, col

# Create a new column with an array of city and department


nested_df = df.withColumn("city_department", array("city",
"department"))

# Extract elements from the array


nested_df.select("id", "name", "city_department",
col("city_department")[0].alias("city"), col("city_department")
[1].alias("department")).show()

Explanation:

● array("city", "department"): Creates an array combining city and


department.
● col("city_department")[0]: Extracts the first element (city) from the array.
● col("city_department")[1]: Extracts the second element (department) from
the array.

5. Exploding and Collecting Data


Scenario: Explode a column containing skills (an array) and then collect all skills back
into an array.
from pyspark.sql.functions import explode, collect_list

# Create a new column with an array of skills


skills_df = df.withColumn("skills", array(lit("Python"), lit("Spark"),
lit("SQL")))

# Explode the skills array into individual rows


exploded_df = skills_df.select("id", "name",
explode(col("skills")).alias("skill"))

# Collect skills back into an array grouped by name


collected_df =
exploded_df.groupBy("name").agg(collect_list("skill").alias("all_skill
s"))

# Display the results


collected_df.show()

Explanation:

1. Exploding Data:
○ explode(col("skills")): Breaks an array column (skills) into multiple
rows, one per element.
○ Each skill for every employee appears in its own row.
2. Collecting Data:
○ collect_list("skill"): Aggregates all skill values into an array for
each employee.
○ Results in a grouped structure with arrays.

Summary of Key Concepts


Topic Key Functions/Methods Purpose

Writing SQL createOrReplaceTempView, Enables SQL-like querying on


Queries on spark.sql DataFrames.
DataFrames
Registering createOrReplaceTempView, Allows DataFrame to be
Temporary Views createGlobalTempView queried using SQL temporarily
or globally.

Query Optimization explain(True) Visualizes the Catalyst


Optimizer's optimizations in the
query execution.

Nested Structures array, map, col Handles complex nested


structures like arrays and
maps.

Exploding and explode, collect_list Flattens arrays into rows or


Collecting aggregates rows into arrays.

6.Optimization in DataFrames
1.1 Predicate Pushdown
Scenario: You are reading a large Parquet file of employee data and want to filter out only
employees from the IT department.

# Read Parquet file with predicate pushdown


df_filtered = spark.read.parquet("employees.parquet").filter(col("department") == "IT")

# Display execution plan


df_filtered.explain(True)
Explanation:

Spark pushes the filter (department == "IT") to the data source (Parquet file), reducing the
amount of data read into Spark.

explain(True): Shows the physical execution plan, confirming that filtering happens during the
file scan.

1.2 Avoiding Shuffles


Scenario: You want to join two DataFrames on the department column but need to minimize
shuffles.

# Create DataFrames
df1 = df.select("id", "name", "department")
df2 = df.select("department", "city")
# Repartition both DataFrames by department
df1 = df1.repartition("department")
df2 = df2.repartition("department")

# Perform a join
joined_df = df1.join(df2, "department")

# Display execution plan


joined_df.explain(True)
Explanation:

repartition("department"): Partitions both DataFrames on the department column, reducing the


number of shuffles during the join.
Optimized shuffles improve the join operation's performance.

1.3 Skewed Data Handling


Scenario: Data is skewed because the IT department has significantly more employees. Handle
the skewed join by using salting.

from pyspark.sql.functions import lit, concat

# Add a salt key to both DataFrames


df1 = df1.withColumn("salt", (col("id") % 10).cast("string"))
df2 = df2.withColumn("salt", lit(""))

# Adjust join keys to include the salt


salted_joined_df = df1.join(df2.withColumn("department_salted", concat(col("department"),
col("salt"))),
df1["department"] == df2["department_salted"])

salted_joined_df.show()
Explanation:

Salting: Adding a pseudo-random key (salt) distributes the skewed data across partitions,
balancing the workload during joins.
2. Performance Tuning

2.1 Managing Spark Configurations


Scenario: You are running a job with insufficient memory. Increase the executor memory.

from pyspark.sql import SparkSession


# Configure Spark session with optimized memory
spark = SparkSession.builder \
.appName("Performance Tuning") \
.config("spark.executor.memory", "4g") \
.config("spark.executor.cores", "2") \
.config("spark.sql.shuffle.partitions", "10") \
.getOrCreate()

# Process the data


df.show()
Explanation:

spark.executor.memory: Allocates 4 GB of memory per executor.


spark.executor.cores: Limits each executor to 2 CPU cores.
spark.sql.shuffle.partitions: Reduces the number of shuffle partitions for better
performance.

2.2 Understanding Spark UI for Debugging


Scenario: Debug slow-running jobs using the Spark UI.

Submit your Spark job.


Access the Spark UI (typically at http://<driver-node>:4040).
Review:
Stages: Check if shuffles are creating bottlenecks.
Tasks: Monitor executor performance and task distribution.
Storage: Inspect cached data for memory issues.

2.3 Broadcast Joins and Shuffle Optimization


Scenario: Join a large df1 with a small df2 DataFrame using a broadcast join.

from pyspark.sql.functions import broadcast


# Perform a broadcast join
broadcast_join_df = df1.join(broadcast(df2), "department")

# Display execution plan


broadcast_join_df.explain(True)
Explanation:

broadcast(): Forces Spark to broadcast the smaller DataFrame (df2) to all executors, avoiding
expensive shuffles during the join.

3. Catalyst Optimizer and Tungsten


3.1 Query Planning and Execution
Scenario: Inspect Catalyst Optimizer's logical and physical plans for a query.

# Example query
query_plan_df = df.groupBy("department").agg({"salary": "avg"})

# Display execution plan


query_plan_df.explain(True)
Explanation:

Catalyst Optimizer transforms the query into logical, optimized logical, and physical plans.
explain(True): Visualizes the entire transformation process.

3.2 Code Generation for Performance


Scenario: Evaluate the Tungsten engine's effect on query execution.
# Perform an operation
tungsten_example = df.withColumn("adjusted_salary", col("salary") * 1.1)

# Display execution plan


tungsten_example.explain(True)
Explanation:

Tungsten generates optimized bytecode for better CPU efficiency, reducing runtime overhead.

4. Optimizing UDFs
Scenario: Replace a Python UDF with a native Spark SQL function.
from pyspark.sql.functions import col, upper

# Avoid UDF by using


native Spark SQL function df_transformed = df.withColumn("name_uppercase",
upper(col("name")))

df_transformed.show()

Sample CSV Data


Save the following data as employees.csv:

id,name,department,city,salary
1,John,IT,New York,80000
2,Jane,Finance,San Francisco,90000
3,Mark,HR,Chicago,70000
4,Linda,IT,Boston,85000
5,James,Finance,New York,95000
6,Susan,IT,San Francisco,75000
7,Robert,HR,Boston,72000
8,Karen,IT,New York,88000
9,Michael,Finance,Chicago,87000
10,Sarah,HR,San Francisco,68000
PySpark Code for the CSV
Here’s how to load and work with this data:

from pyspark.sql import SparkSession


from pyspark.sql.functions import col, broadcast, upper, lit, concat, expr

# Initialize Spark Session


spark = SparkSession.builder \
.appName("Optimization Examples") \
.config("spark.sql.shuffle.partitions", "2") \
.getOrCreate()

# Load the CSV file


file_path = "employees.csv" # Ensure this file is in the correct directory
df = spark.read.csv(file_path, header=True, inferSchema=True)

# Show the loaded data


df.show()

# Example 1: Predicate Pushdown


df_filtered = df.filter(col("department") == "IT")
df_filtered.explain(True)
df_filtered.show()

# Example 2: Avoiding Shuffles with Repartitioning


df1 = df.select("id", "name", "department").repartition("department")
df2 = df.select("department", "city").repartition("department")
joined_df = df1.join(df2, "department")
joined_df.explain(True)
joined_df.show()

# Example 3: Skewed Data Handling with Salting


salted_df1 = df.withColumn("salt", (col("id") % 10).cast("string"))
salted_df2 = df.withColumn("salt", lit(""))
salted_joined_df = salted_df1.join(
salted_df2.withColumn("department_salted", concat(col("department"), col("salt"))),
salted_df1["department"] == salted_df2["department_salted"]
)
salted_joined_df.show()

# Example 4: Broadcast Joins


small_df = df.filter(col("department") == "HR")
broadcast_join = df.join(broadcast(small_df), "department")
broadcast_join.explain(True)
broadcast_join.show()

# Example 5: Optimizing UDFs


optimized_df = df.withColumn("name_uppercase", upper(col("name")))
optimized_df.show()

# Example 6: Catalyst Optimizer Example


grouped_df = df.groupBy("department").agg({"salary": "avg"})
grouped_df.explain(True)
grouped_df.show()

Key Notes:
CSV File: Ensure the employees.csv file is in the same directory where you're running the script
or provide the absolute file path in the file_path variable.
Code Structure:
Each example builds on the loaded DataFrame (df) from the CSV.
Optimizations (predicate pushdown, repartitioning, etc.) are showcased step-by-step.
Reusable Data: All examples use the same CSV data to keep consistency across operations.

Shuffling and Partition are required?

What is Shuffling in PySpark?


● Shuffling is the process of redistributing data across partitions in a Spark cluster. It is
triggered when operations require data from one partition to be moved to another, such
as joins, groupBy, or aggregations.
● Why Shuffling Happens: Shuffling occurs to align data with the operations being
performed (e.g., when grouping or joining data based on a specific key).

What is Partitioning in PySpark?


● Partitioning refers to dividing a dataset into smaller, logical chunks (partitions)
distributed across the nodes of a cluster. Each partition can be processed in parallel.
● Partitioning helps Spark manage data locality and minimize shuffles for certain
operations.

Why Shuffling and Partitioning are Required?


1. Shuffling:
○ To group or join data based on specific keys, all rows with the same key must
end up in the same partition.
○ For example, in a groupBy operation, all rows for a given key (e.g.,
"department") must be collected into the same partition to compute the
aggregation.
2. Partitioning:
○ Partitioning allows Spark to optimize data movement during shuffle-heavy
operations.
○ By pre-partitioning data, operations like joins and aggregations can avoid
unnecessary shuffling.

When to Use Partitioning?


1. Before joins: Partitioning both DataFrames on the join key can reduce shuffles.
2. Before groupBy: Partitioning the data on the group key minimizes shuffle overhead.
3. Skewed data: When certain keys dominate, custom partitioning can help balance
workloads.

Clear Example

Scenario: Joining two DataFrames (df1 and df2) on the department column.

Without Partitioning
from pyspark.sql import SparkSession

# Initialize Spark session


spark = SparkSession.builder.appName("ShufflingExample").getOrCreate()

# Sample DataFrames
data1 = [(1, "John", "IT"), (2, "Jane", "Finance"), (3, "Mark", "HR"),
(4, "Linda", "IT")]
data2 = [("IT", "New York"), ("Finance", "San Francisco"), ("HR",
"Chicago")]

df1 = spark.createDataFrame(data1, ["id", "name", "department"])


df2 = spark.createDataFrame(data2, ["department", "city"])

# Join operation
joined_df = df1.join(df2, "department")
joined_df.explain(True)
joined_df.show()

Explanation:

● Shuffling Happens: Since df1 and df2 are not pre-partitioned on the department
column, Spark shuffles data across partitions to align department keys.

Output of explain(True):

● You will see a shuffle exchange stage, indicating data movement across partitions.

With Partitioning
python
Copy code
# Repartition both DataFrames by the join key (department)
df1 = df1.repartition("department")
df2 = df2.repartition("department")

# Join operation
joined_df = df1.join(df2, "department")
joined_df.explain(True)
joined_df.show()

Explanation:
● No Shuffle Exchange: By repartitioning both DataFrames on department, Spark
ensures that data with the same department value is colocated in the same partition.
This avoids expensive shuffling.

You might also like