0% found this document useful (0 votes)

750 views31 pages

Understanding PySpark and Big Data

The document discusses the evolution and significance of Big Data and Apache Spark, detailing the challenges of managing large volumes, varieties, velocities, and veracities of data. It highlights Spark's advantages over Hadoop, including speed, ease of use, and flexibility, and explains key concepts such as RDDs, DataFrames, and Datasets. Additionally, it provides practical examples of data manipulation using PySpark, including filtering, aggregating, joining, and handling missing data.

Uploaded by

Abhishek Dutta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

750 views31 pages

Understanding PySpark and Big Data

Uploaded by

Abhishek Dutta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

The Birth of Big Data (The Beginning)

In Digitalia, every action of its citizens—tweets, purchases, selfies—adds to the kingdom’s

treasure trove of information. But managing this treasure is no small feat. Advisors break down
Big Data:

● Volume:
The size of data is staggering—petabytes, even zettabytes. Traditional databases
collapse under such weight. Imagine tracking every grain of sand on Earth; that’s the
challenge of volume.
● Variety:
Data doesn’t come neatly packaged. There’s structured data (like tables), semi-
structured data (like JSON or XML), and unstructured data (like videos or audio). A
single event, such as booking a flight, generates multiple types of data.
● Velocity:
Data moves at blinding speed, often in real-time. A streaming video platform like Netflix
gathers user data every second, which must be processed immediately to provide
recommendations or maintain quality.
● Veracity:
Not all data is reliable. Noise, inconsistencies, or outright inaccuracies can mislead
decision-makers. For example, social media data may contain spam or fake reviews.

The Rise of Distributed Computing (The Strategy)

The sage Distributed Computing introduces the kingdom to a game-changing strategy: Divide
and Conquer. Instead of one computer trying to process everything, many work together. Key
principles include:

● Parallelism: Multiple tasks run simultaneously.

● Fault Tolerance: If one machine fails, others pick up the slack.
● Scalability: The system grows by adding more machines.

For instance, imagine slicing a massive cake (data) and sharing it among hundreds of bakers
(machines). Together, they process the cake faster than one baker could. Frameworks like
Hadoop’s HDFS (Hadoop Distributed File System) lay the foundation for this approach.

Enter Apache Spark (The Hero)

Apache Spark enters as the revolutionary warrior in Big Data's saga, armed with unique
strengths:

● In-Memory Computing: Unlike older tools that write intermediate data to disk, Spark
keeps data in memory. This boosts speed dramatically, particularly for iterative tasks like
machine learning.
● Unified Platform: Spark handles batch processing (like Hadoop), real-time streaming,
machine learning, and graph analytics—all in one tool.
● Polyglot: Spark speaks multiple programming languages: Python (PySpark), Java,
Scala, and R, making it accessible to diverse developers.

The Hero's Journey (History and Evolution)

Spark’s story begins in 2009 at UC Berkeley’s AMPLab. Dissatisfied with Hadoop's slow
performance for iterative algorithms, researchers created Spark to handle both speed and
variety. Key milestones:

● 2013: Spark becomes open-source under the Apache Software Foundation.

● 2014: Spark 1.0 is released, with widespread adoption in industries from banking to e-
commerce.
● 2018: Spark 2.0 introduces Structured Streaming, enabling developers to process real-
time data more effectively.

Today, Spark powers everything from ride-sharing platforms (Uber) to streaming services
(Netflix).

Features and Use Cases (The Hero's Powers)

Spark boasts powerful abilities:

1. Speed: Handles petabytes of data at unmatched speeds.

2. Ease of Use: With PySpark, even beginners can process data using Python.
3. Flexibility: Works seamlessly with data stored in Hadoop, cloud systems, or local
machines.

Use Cases:

● Machine Learning: Train models on massive datasets in record time.

● Real-Time Analytics: Monitor stock prices or detect fraud as it happens.
● ETL (Extract, Transform, Load): Process and clean data efficiently for storage and
analysis.

Spark vs. Hadoop (The Rivalry)

Although Spark and Hadoop serve the same kingdom, they approach problems differently:

● Processing Speed: Spark’s in-memory processing is 10-100x faster than Hadoop’s

disk-based MapReduce.
● Ease of Use: Spark offers APIs in high-level languages like Python, whereas Hadoop
relies on Java.
● Flexibility: Spark supports real-time streaming, while Hadoop focuses on batch
processing.

Still, Spark can run on Hadoop’s distributed storage (HDFS), making them allies when needed.

What does PySpark?

Simply put what it does is to execute operations on distributed data. Thus, the operations also
need to be distributed. Some operations are simple, such as filtering out all items that don't
respect some rule. Others are more complex, such as groupBy which needs to move data
around, and join which needs to associate items from 2 or more datasets.

Another important fact is that input and output are stored in different formats, spark has
connectors to read and write those. But that means to serialize and deserialize them. While
being transparent, serialization is often the most expensive operation.

Finally, spark tries to keep data in memory for processing but it will [ser/deser]ialize data on
each worker locally when it doesn't fit in memory. Once again, it is done transparently but can
be costly

Difference between RDD vs Dataframe:

RDD
It's the first API provided by spark. To put is simply it is a not-ordered sequence of scala/java
objects distributed over a cluster. All operations executed on it are jvm methods (passed to
map, flatmap, groupBy, ...) that need to be serialized, send to all workers, and be applied to the
jvm objects there. This is pretty much the same as using a scala Seq, but distributed. It is
strongly typed, meaning that "if it compiles then it works" (if you don't cheat). However, there are
lots of distribution issues that can arise. Especially if spark doesn't know how to [de]serialize the
jvm classes and methods.

Dataframe
It came after and is semantically very different from RDD. The data are considered as tables
and operations such as sql operations can be applied on it. It is not typed at all, so error can
arise at any time during execution. However, there are I think 2 pros: (1) many people are used
to the table/sql semantic and operations, and (2) spark doesn't need to deserialize the whole
line to process one of its column, if the data format provide suitable column access. And many
do, such as the parquet file format that is the most commonly used.

Dataset
It is an improvement of Dataframe to bring some type-safety. Dataset are dataframe to which
we associate an "encoder" related to a jvm class. So spark can check that the data schema is
correct before executing the code. Note however that, we can read sometime that dataset are
strongly type, but it is not: it brings some strongly type safety where you cannot compile code
that use a Dataset with a type that is not what has been declared. But it is very easy to make
code that compile but still fail at runtime. This is because many dataset operations loose the
type (pretty much everything apart from filter). Still it is a huge improvements because even
when we make mistake, it will fail fast: failure happens when interpreting the spark DAG (i.e. at
start) instead of during data processing.

Pros and cons

● Dataset:
● pros: has optimized operations over column oriented storages
● pros: also many operations doesn't need deserialization
● pros: provide table/sql semantic if you like it (I don't ;)
● pros: dataset operations comes with an optimization engine
"catalyst" that improves the performance of your code. I'm not sure
however if it is really that great. If you know what you code, i.e. what
is done to the data, your code should be optimized by itself.
● cons: most operation loose typing
● cons: dataset operations can become too complicated for complex
algorithm that doesn't suit it. The 2 main limits I know are managing
invalid data and complex math algorithm.
● Dataframe:
● pros: required between dataset operations that lose type
● cons: just use Dataset it has all the advantages and more
● RDD:
● pros: (really) strongly typed
● pros: scala/java semantic. You can design your code pretty much
how you would for a single-jvm app that process in-memory
collections. Well, with functional semantic :)
● cons: full jvm deserialization is required to process data, at any step
mentioned before: after reading input, and between all processing
steps that requires data to be moved between worker, or stored
locally to manage memory bound.

Import data types

Many PySpark operations require that you use SQL functions or interact with native Spark
types. You can either directly import only those functions and types that you need, or you can
import the entire module.

# import all
from [Link] import *
from [Link] import *

# import select functions and types

from [Link] import IntegerType, StringType
from [Link] import floor, round

Because some imported functions might override Python built-in functions, some users choose
to import these modules using an alias. The following examples show a common alias used in
Apache Spark code examples:

import [Link] as T
import [Link] as F

Create a DataFrame
There are several ways to create a DataFrame. Usually you define a DataFrame against a data
source such as a table or collection of files. Then as described in the Apache Spark
fundamental concepts section, use an action, such as display, to trigger the transformations to
execute. The display method outputs DataFrames.

Create a DataFrame with specified values

To create a DataFrame with specified values, use the createDataFrame method, where rows
are expressed as a list of tuples:

df_children = [Link](
data = [("Mikhail", 15), ("Zaky", 13), ("Zoya", 8)],
schema = ['name', 'age'])
display(df_children)

Notice in the output that the data types of columns of df_children are automatically inferred. You
can alternatively specify the types by adding a schema. Schemas are defined using the
StructType which is made up of StructFields that specify the name, data type and a boolean flag
indicating whether they contain a null value or not. You must import data types from
[Link].

from [Link] import StructType, StructField, StringType, IntegerType

df_children_with_schema = [Link](
data = [("Mikhail", 15), ("Zaky", 13), ("Zoya", 8)],
schema = StructType([
StructField('name', StringType(), True),
StructField('age', IntegerType(), True)
])
)
display(df_children_with_schema)

1. Selecting, Filtering, and Sorting Data

Scenario:
You want to see employees older than 30 working in the "Engineering" department,
sorted by salary in descending order.

# Filter, Select, and Sort Example

filtered_df = [Link]((col("age") > 30) & (col("department") ==
"Engineering")) \
.select("id", "name", "age", "salary", "department") \
.orderBy(col("salary").desc())

filtered_df.show()

Explanation:

● filter() filters rows based on the condition (age > 30 and department ==
"Engineering").
● select() chooses specific columns to display.
● orderBy() sorts the results by salary in descending order.
● col() references a column to keep the code cleaner and avoid hardcoding
column names.

2. Aggregations and groupBy Operations

Scenario:
You want to calculate the average salary and total bonus per department.

# GroupBy and Aggregation Example

aggregated_df = [Link]("department") \
.agg(avg("salary").alias("avg_salary"),
sum("bonus").alias("total_bonus"))

aggregated_df.show()

Explanation:

● groupBy("department") groups the data by the "department" column.

● agg() applies aggregate functions like avg (average) and sum (sum).
● alias() renames the aggregated columns for readability.

3. Joining DataFrames
Scenario:
Assume you have another dataset with department managers. Join this data with the
original dataset to include manager names.

# Create a second DataFrame (department managers)

managers_data = [( "HR", "John"), ("Finance", "Mary"), ("Engineering",
"Steve"), ("Marketing", "Kate")]
managers_columns = ["department", "manager_name"]
managers_df = [Link](managers_data, managers_columns)

# Join with the original DataFrame

joined_df = [Link](managers_df, on="department", how="left")

joined_df.show()

Explanation:

● A new DataFrame managers_df is created containing department managers.

● join() combines df and managers_df using the "department" column.
● how="left" ensures all rows from the left DataFrame (df) are included.

4. Handling Missing Data

Scenario:
You want to replace missing bonus values with 0.

# Handling Missing Data

cleaned_df = [Link]({"bonus": 0})
cleaned_df.show()
Explanation:

● fillna() replaces null (missing) values with the specified value (e.g., 0 for
"bonus").

5. Built-in Functions (col, lit, when, etc.)

Scenario:
You want to add a new column salary_category based on the salary values.

# Add a Derived Column using `when` and `col`

updated_df = [Link]("salary_category",
when(col("salary") > 100000, "High")
.when(col("salary").between(60000, 100000), "Medium")
.otherwise("Low"))

updated_df.select("id", "name", "salary", "salary_category").show()

Explanation:

● withColumn() adds a new column to the DataFrame.

● when() and otherwise() apply conditional logic to categorize salaries.
● col() references the "salary" column.

6. User-Defined Functions (UDFs)

Scenario:
You want to mask employee names for data privacy by only showing the first letter.

from [Link] import udf

from [Link] import StringType

# Define a UDF to mask names

def mask_name(name):
return name[0] + "*" * (len(name) - 1)

mask_name_udf = udf(mask_name, StringType())

# Apply the UDF to create a new column

masked_df = [Link]("masked_name", mask_name_udf(col("name")))
masked_df.select("id", "name", "masked_name").show()

Explanation:

● A Python function mask_name() is defined to mask the name.

● udf() registers the function as a User-Defined Function with Spark.
● withColumn() applies the UDF to create a new column.

Example 1: Calculate Years Until Retirement

Scenario:
You want to calculate how many years each employee has until retirement, assuming the
retirement age is 65.

from [Link] import IntegerType

# Define a UDF to calculate years until retirement

def years_to_retirement(age):

return max(65 - age, 0)

# Register the UDF

retirement_udf = udf(years_to_retirement, IntegerType())

# Apply the UDF to create a new column

retirement_df = [Link]("years_to_retirement",
retirement_udf(col("age")))

# Display the result

retirement_df.select("id", "name", "age",

"years_to_retirement").show()

Explanation:

● years_to_retirement() is a Python function that calculates the difference

between the retirement age (65) and the employee's current age.
● udf() registers the function with Spark and specifies the return type as
IntegerType.
● withColumn() creates a new column, years_to_retirement, by applying the
UDF.

Example 2: Categorize Cities by Region

Scenario:
You want to add a column categorizing cities into regions (e.g., "East", "West",
"Central").

# Define a UDF to categorize cities by region

def city_to_region(city):

if city in ["New York", "Houston"]:

return "East"

elif city in ["Los Angeles", "San Francisco"]:

return "West"

elif city == "Chicago":

return "Central"

else:

return "Unknown"

# Register the UDF

region_udf = udf(city_to_region, StringType())

# Apply the UDF to create a new column

region_df = [Link]("region", region_udf(col("city")))

# Display the result

region_df.select("id", "city", "region").show()

Explanation:

● city_to_region() is a Python function that maps each city to a specific region.

● udf() registers the function with Spark and specifies the return type as
StringType.
● withColumn() adds a new column, region, to the DataFrame.

7. Window Functions
Scenario:
You want to rank employees within each department by their salary.

# Define a window partitioned by department and ordered by salary

window_spec =
[Link]("department").orderBy(col("salary").desc())
# Add a rank column
ranked_df = [Link]("rank", row_number().over(window_spec))
ranked_df.select("id", "name", "department", "salary", "rank").show()

Explanation:

● [Link]() defines the window (grouping by department).

● orderBy() sorts salaries within each partition in descending order.
● row_number() assigns a rank based on the order.

Here are examples of various Window Functions using the same dataset:

1. Ranking Employees by Salary within Departments

Scenario: Rank employees within each department based on their salary in descending
order.

from [Link] import Window

from [Link] import row_number

# Define a Window specification

window_spec_rank =
[Link]("department").orderBy(col("salary").desc())

# Add a rank column

ranked_df = [Link]("rank", row_number().over(window_spec_rank))

# Display the results

ranked_df.select("id", "name", "department", "salary", "rank").show()

Explanation:

● partitionBy("department"): Groups employees by department.

● orderBy(col("salary").desc()): Orders employees within each department
by salary in descending order.
● row_number(): Assigns a unique rank to each employee within the partition.
2. Calculate the Running Total of Salaries within Departments
Scenario: Calculate a cumulative (running) sum of salaries for each department.

from [Link] import sum

# Define a Window specification

window_spec_cumsum =
[Link]("department").orderBy("salary")

# Add a cumulative sum column

cumsum_df = [Link]("running_total_salary",
sum("salary").over(window_spec_cumsum))

# Display the results

cumsum_df.select("id", "name", "department", "salary",
"running_total_salary").show()

Explanation:

● sum("salary").over(window_spec_cumsum): Computes a cumulative sum of

salaries ordered by salary within each department.

3. Average Salary within Departments

Scenario: Calculate the average salary for each department and include it as a column
for all employees.

from [Link] import avg

# Define a Window specification

window_spec_avg = [Link]("department")

# Add an average salary column

avg_salary_df = [Link]("avg_salary",
avg("salary").over(window_spec_avg))
# Display the results
avg_salary_df.select("id", "name", "department", "salary",
"avg_salary").show()

Explanation:

● avg("salary").over(window_spec_avg): Computes the average salary for

each department.
● No ordering is needed since we are calculating an aggregate for the entire
partition.

4. Difference Between Current Salary and Average Salary

Scenario: Calculate the difference between each employee's salary and the average
salary in their department.

# Add a column for salary difference from the average

salary_diff_df = avg_salary_df.withColumn("salary_diff", col("salary")
- col("avg_salary"))

# Display the results

salary_diff_df.select("id", "name", "department", "salary",
"avg_salary", "salary_diff").show()

Explanation:

● col("salary") - col("avg_salary"): Computes the difference between the

current salary and the average salary for the department.

5. Assign Dense Rank Based on Bonus within Departments

Scenario: Assign a dense rank to employees based on their bonuses within each
department.
from [Link] import dense_rank

# Define a Window specification

window_spec_dense_rank =
[Link]("department").orderBy(col("bonus").desc())

# Add a dense rank column

dense_rank_df = [Link]("dense_rank",
dense_rank().over(window_spec_dense_rank))

# Display the results

dense_rank_df.select("id", "name", "department", "bonus",
"dense_rank").show()

Explanation:

● dense_rank(): Assigns ranks like row_number but ensures no gaps in ranking if

there are ties.
● orderBy(col("bonus").desc()): Orders employees by bonuses within each
department.

6. Lead and Lag Functions

Scenario: Find the previous and next salary for each employee within their department
based on salary order.

from [Link] import lead, lag

# Define a Window specification

window_spec_lead_lag =
[Link]("department").orderBy("salary")

# Add previous and next salary columns

lead_lag_df = [Link]("prev_salary",
lag("salary").over(window_spec_lead_lag)) \
.withColumn("next_salary",
lead("salary").over(window_spec_lead_lag))
# Display the results
lead_lag_df.select("id", "name", "department", "salary",
"prev_salary", "next_salary").show()

Explanation:

● lag("salary"): Retrieves the salary of the previous employee within the

department.
● lead("salary"): Retrieves the salary of the next employee within the
department.

7. Row Number with a Custom Partition

Scenario: Number employees across all departments without grouping them by
department.

# Define a Window specification without partitioning

window_spec_row_number = [Link]("salary")

# Add a row number column

row_number_df = [Link]("row_number",
row_number().over(window_spec_row_number))

# Display the results

row_number_df.select("id", "name", "department", "salary",
"row_number").show()

Explanation:

● [Link]("salary"): Orders the entire DataFrame by salary, ignoring

partitions.
● row_number(): Assigns a unique row number to each employee.

8. Percent Rank
Scenario: Calculate the percentile rank of each employee's salary within their
department.
from [Link] import percent_rank

# Define a Window specification

window_spec_percent_rank =
[Link]("department").orderBy("salary")

# Add a percent rank column

percent_rank_df = [Link]("percent_rank",
percent_rank().over(window_spec_percent_rank))

# Display the results

percent_rank_df.select("id", "name", "department", "salary",
"percent_rank").show()

Explanation:

● percent_rank(): Computes the percentile rank of each employee within their

department based on their salary.
5. Advanced topics in PySpark

1. Writing SQL Queries on DataFrames

Scenario: You want to find the top 3 employees with the highest salaries.

# Register the DataFrame as a temporary SQL view

[Link]("employees")

# Write SQL query to fetch the top 3 employees with the highest
salaries
top_3_employees = [Link]("""
SELECT id, name, department, salary
FROM employees
ORDER BY salary DESC
LIMIT 3
""")

# Display the result

top_3_employees.show()

Explanation:

● createOrReplaceTempView("employees"): Registers the DataFrame as a

temporary SQL view named employees.
● [Link](): Executes a SQL query to retrieve data from the temporary view.
● ORDER BY salary DESC: Sorts employees by salary in descending order.
● LIMIT 3: Restricts the output to the top 3 rows.

2. Registering DataFrames as Temporary Views

Scenario: You want to analyze employees by department using SQL.

# Register the DataFrame as a global temporary view

[Link]("global_employees")

# Query the global temporary view

department_analysis = [Link]("""
SELECT department, COUNT(*) AS employee_count, AVG(salary) AS
avg_salary
FROM global_temp.global_employees
GROUP BY department
""")

# Display the result

department_analysis.show()

Explanation:

● createGlobalTempView("global_employees"): Registers the DataFrame as a

global temporary view that is accessible across sessions.
● GROUP BY department: Groups data by the department column.
● COUNT(*) and AVG(salary): Calculate the number of employees and average
salary in each department.

3. Query Optimization with Catalyst Optimizer

Scenario: Observe query optimization when filtering data.

# Filter employees with a salary above 70,000 using DataFrame API

filtered_df = [Link](col("salary") > 70000)

# Display the physical plan (Catalyst Optimizer optimizations)

filtered_df.explain(True)
Explanation:

● filter(col("salary") > 70000): Filters employees with a salary above

70,000.
● explain(True): Shows the physical execution plan, highlighting Catalyst
Optimizer's optimizations (e.g., predicate pushdown, projection pruning).

4. Working with Nested Structures (Arrays)

Scenario: Add a column containing an array of employee's city and department, then
extract these elements.

from [Link] import array, col

# Create a new column with an array of city and department

nested_df = [Link]("city_department", array("city",
"department"))

# Extract elements from the array

nested_df.select("id", "name", "city_department",
col("city_department")[0].alias("city"), col("city_department")
[1].alias("department")).show()

Explanation:

● array("city", "department"): Creates an array combining city and

department.
● col("city_department")[0]: Extracts the first element (city) from the array.
● col("city_department")[1]: Extracts the second element (department) from
the array.

5. Exploding and Collecting Data

Scenario: Explode a column containing skills (an array) and then collect all skills back
into an array.
from [Link] import explode, collect_list

# Create a new column with an array of skills

skills_df = [Link]("skills", array(lit("Python"), lit("Spark"),
lit("SQL")))

# Explode the skills array into individual rows

exploded_df = skills_df.select("id", "name",
explode(col("skills")).alias("skill"))

# Collect skills back into an array grouped by name

collected_df =
exploded_df.groupBy("name").agg(collect_list("skill").alias("all_skill
s"))

# Display the results

collected_df.show()

Explanation:

1. Exploding Data:
○ explode(col("skills")): Breaks an array column (skills) into multiple
rows, one per element.
○ Each skill for every employee appears in its own row.
2. Collecting Data:
○ collect_list("skill"): Aggregates all skill values into an array for
each employee.
○ Results in a grouped structure with arrays.

Summary of Key Concepts

Topic Key Functions/Methods Purpose

Writing SQL createOrReplaceTempView, Enables SQL-like querying on

Queries on [Link] DataFrames.
DataFrames
Registering createOrReplaceTempView, Allows DataFrame to be
Temporary Views createGlobalTempView queried using SQL temporarily
or globally.

Query Optimization explain(True) Visualizes the Catalyst

Optimizer's optimizations in the
query execution.

Nested Structures array, map, col Handles complex nested

structures like arrays and
maps.

Exploding and explode, collect_list Flattens arrays into rows or

Collecting aggregates rows into arrays.

[Link] in DataFrames
1.1 Predicate Pushdown
Scenario: You are reading a large Parquet file of employee data and want to filter out only
employees from the IT department.

# Read Parquet file with predicate pushdown

df_filtered = [Link]("[Link]").filter(col("department") == "IT")

# Display execution plan

df_filtered.explain(True)
Explanation:

Spark pushes the filter (department == "IT") to the data source (Parquet file), reducing the
amount of data read into Spark.

explain(True): Shows the physical execution plan, confirming that filtering happens during the
file scan.

1.2 Avoiding Shuffles

Scenario: You want to join two DataFrames on the department column but need to minimize
shuffles.

# Create DataFrames
df1 = [Link]("id", "name", "department")
df2 = [Link]("department", "city")
# Repartition both DataFrames by department
df1 = [Link]("department")
df2 = [Link]("department")

# Perform a join
joined_df = [Link](df2, "department")

# Display execution plan

joined_df.explain(True)
Explanation:

repartition("department"): Partitions both DataFrames on the department column, reducing the

number of shuffles during the join.
Optimized shuffles improve the join operation's performance.

1.3 Skewed Data Handling

Scenario: Data is skewed because the IT department has significantly more employees. Handle
the skewed join by using salting.

from [Link] import lit, concat

# Add a salt key to both DataFrames

df1 = [Link]("salt", (col("id") % 10).cast("string"))
df2 = [Link]("salt", lit(""))

# Adjust join keys to include the salt

salted_joined_df = [Link]([Link]("department_salted", concat(col("department"),
col("salt"))),
df1["department"] == df2["department_salted"])

salted_joined_df.show()
Explanation:

Salting: Adding a pseudo-random key (salt) distributes the skewed data across partitions,
balancing the workload during joins.
2. Performance Tuning

2.1 Managing Spark Configurations

Scenario: You are running a job with insufficient memory. Increase the executor memory.

from [Link] import SparkSession

# Configure Spark session with optimized memory
spark = [Link] \
.appName("Performance Tuning") \
.config("[Link]", "4g") \
.config("[Link]", "2") \
.config("[Link]", "10") \
.getOrCreate()

# Process the data

[Link]()
Explanation:

[Link]: Allocates 4 GB of memory per executor.

[Link]: Limits each executor to 2 CPU cores.
[Link]: Reduces the number of shuffle partitions for better
performance.

2.2 Understanding Spark UI for Debugging

Scenario: Debug slow-running jobs using the Spark UI.

Submit your Spark job.

Access the Spark UI (typically at [Link]
Review:
Stages: Check if shuffles are creating bottlenecks.
Tasks: Monitor executor performance and task distribution.
Storage: Inspect cached data for memory issues.

2.3 Broadcast Joins and Shuffle Optimization

Scenario: Join a large df1 with a small df2 DataFrame using a broadcast join.

from [Link] import broadcast

# Perform a broadcast join
broadcast_join_df = [Link](broadcast(df2), "department")

# Display execution plan

broadcast_join_df.explain(True)
Explanation:

broadcast(): Forces Spark to broadcast the smaller DataFrame (df2) to all executors, avoiding
expensive shuffles during the join.

3. Catalyst Optimizer and Tungsten

3.1 Query Planning and Execution
Scenario: Inspect Catalyst Optimizer's logical and physical plans for a query.

# Example query
query_plan_df = [Link]("department").agg({"salary": "avg"})

# Display execution plan

query_plan_df.explain(True)
Explanation:

Catalyst Optimizer transforms the query into logical, optimized logical, and physical plans.
explain(True): Visualizes the entire transformation process.

3.2 Code Generation for Performance

Scenario: Evaluate the Tungsten engine's effect on query execution.
# Perform an operation
tungsten_example = [Link]("adjusted_salary", col("salary") * 1.1)

# Display execution plan

tungsten_example.explain(True)
Explanation:

Tungsten generates optimized bytecode for better CPU efficiency, reducing runtime overhead.

4. Optimizing UDFs
Scenario: Replace a Python UDF with a native Spark SQL function.
from [Link] import col, upper

# Avoid UDF by using

native Spark SQL function df_transformed = [Link]("name_uppercase",
upper(col("name")))

df_transformed.show()

Sample CSV Data

Save the following data as [Link]:

id,name,department,city,salary
1,John,IT,New York,80000
2,Jane,Finance,San Francisco,90000
3,Mark,HR,Chicago,70000
4,Linda,IT,Boston,85000
5,James,Finance,New York,95000
6,Susan,IT,San Francisco,75000
7,Robert,HR,Boston,72000
8,Karen,IT,New York,88000
9,Michael,Finance,Chicago,87000
10,Sarah,HR,San Francisco,68000
PySpark Code for the CSV
Here’s how to load and work with this data:

from [Link] import SparkSession

from [Link] import col, broadcast, upper, lit, concat, expr

# Initialize Spark Session

spark = [Link] \
.appName("Optimization Examples") \
.config("[Link]", "2") \
.getOrCreate()

# Load the CSV file

file_path = "[Link]" # Ensure this file is in the correct directory
df = [Link](file_path, header=True, inferSchema=True)

# Show the loaded data

[Link]()

# Example 1: Predicate Pushdown

df_filtered = [Link](col("department") == "IT")
df_filtered.explain(True)
df_filtered.show()

# Example 2: Avoiding Shuffles with Repartitioning

df1 = [Link]("id", "name", "department").repartition("department")
df2 = [Link]("department", "city").repartition("department")
joined_df = [Link](df2, "department")
joined_df.explain(True)
joined_df.show()

# Example 3: Skewed Data Handling with Salting

salted_df1 = [Link]("salt", (col("id") % 10).cast("string"))
salted_df2 = [Link]("salt", lit(""))
salted_joined_df = salted_df1.join(
salted_df2.withColumn("department_salted", concat(col("department"), col("salt"))),
salted_df1["department"] == salted_df2["department_salted"]
)
salted_joined_df.show()

# Example 4: Broadcast Joins

small_df = [Link](col("department") == "HR")
broadcast_join = [Link](broadcast(small_df), "department")
broadcast_join.explain(True)
broadcast_join.show()

# Example 5: Optimizing UDFs

optimized_df = [Link]("name_uppercase", upper(col("name")))
optimized_df.show()

# Example 6: Catalyst Optimizer Example

grouped_df = [Link]("department").agg({"salary": "avg"})
grouped_df.explain(True)
grouped_df.show()

Key Notes:
CSV File: Ensure the [Link] file is in the same directory where you're running the script
or provide the absolute file path in the file_path variable.
Code Structure:
Each example builds on the loaded DataFrame (df) from the CSV.
Optimizations (predicate pushdown, repartitioning, etc.) are showcased step-by-step.
Reusable Data: All examples use the same CSV data to keep consistency across operations.

Shuffling and Partition are required?

What is Shuffling in PySpark?

● Shuffling is the process of redistributing data across partitions in a Spark cluster. It is
triggered when operations require data from one partition to be moved to another, such
as joins, groupBy, or aggregations.
● Why Shuffling Happens: Shuffling occurs to align data with the operations being
performed (e.g., when grouping or joining data based on a specific key).

What is Partitioning in PySpark?

● Partitioning refers to dividing a dataset into smaller, logical chunks (partitions)
distributed across the nodes of a cluster. Each partition can be processed in parallel.
● Partitioning helps Spark manage data locality and minimize shuffles for certain
operations.

Why Shuffling and Partitioning are Required?

1. Shuffling:
○ To group or join data based on specific keys, all rows with the same key must
end up in the same partition.
○ For example, in a groupBy operation, all rows for a given key (e.g.,
"department") must be collected into the same partition to compute the
aggregation.
2. Partitioning:
○ Partitioning allows Spark to optimize data movement during shuffle-heavy
operations.
○ By pre-partitioning data, operations like joins and aggregations can avoid
unnecessary shuffling.

When to Use Partitioning?

1. Before joins: Partitioning both DataFrames on the join key can reduce shuffles.
2. Before groupBy: Partitioning the data on the group key minimizes shuffle overhead.
3. Skewed data: When certain keys dominate, custom partitioning can help balance
workloads.

Clear Example

Scenario: Joining two DataFrames (df1 and df2) on the department column.

Without Partitioning
from [Link] import SparkSession

# Initialize Spark session

spark = [Link]("ShufflingExample").getOrCreate()

# Sample DataFrames
data1 = [(1, "John", "IT"), (2, "Jane", "Finance"), (3, "Mark", "HR"),
(4, "Linda", "IT")]
data2 = [("IT", "New York"), ("Finance", "San Francisco"), ("HR",
"Chicago")]

df1 = [Link](data1, ["id", "name", "department"])

df2 = [Link](data2, ["department", "city"])

# Join operation
joined_df = [Link](df2, "department")
joined_df.explain(True)
joined_df.show()

Explanation:

● Shuffling Happens: Since df1 and df2 are not pre-partitioned on the department
column, Spark shuffles data across partitions to align department keys.

Output of explain(True):

● You will see a shuffle exchange stage, indicating data movement across partitions.

With Partitioning
python
Copy code
# Repartition both DataFrames by the join key (department)
df1 = [Link]("department")
df2 = [Link]("department")

# Join operation
joined_df = [Link](df2, "department")
joined_df.explain(True)
joined_df.show()

Explanation:
● No Shuffle Exchange: By repartitioning both DataFrames on department, Spark
ensures that data with the same department value is colocated in the same partition.
This avoids expensive shuffling.

Pyspark Fundamentals and Basics Guide
No ratings yet
Pyspark Fundamentals and Basics Guide
74 pages
Overview of Apache Spark and RDDs
100% (1)
Overview of Apache Spark and RDDs
109 pages
Apache Spark Basics for Data Engineers
100% (1)
Apache Spark Basics for Data Engineers
15 pages
Python Interview Questions and Answers
No ratings yet
Python Interview Questions and Answers
4 pages
Databricks Tutorial for Beginners
No ratings yet
Databricks Tutorial for Beginners
2 pages
BigQuery Interview Questions Guide
No ratings yet
BigQuery Interview Questions Guide
5 pages
Data Modelling Techniques Overview
No ratings yet
Data Modelling Techniques Overview
40 pages
Apache Spark Tutorial Overview
100% (1)
Apache Spark Tutorial Overview
6 pages
Most Experienced Employees by Project
No ratings yet
Most Experienced Employees by Project
60 pages
SQL, Spark SQL, and PySpark Comparison
No ratings yet
SQL, Spark SQL, and PySpark Comparison
11 pages
Data Engineering with AWS Overview
No ratings yet
Data Engineering with AWS Overview
6 pages
Near Real-Time Big Data Processing
No ratings yet
Near Real-Time Big Data Processing
59 pages
Understanding Google BigQuery Basics
No ratings yet
Understanding Google BigQuery Basics
2 pages
Comprehensive PySpark Guide PDF
No ratings yet
Comprehensive PySpark Guide PDF
3 pages
PySpark Mastery: Zero to Hero Guide
No ratings yet
PySpark Mastery: Zero to Hero Guide
6 pages
Understanding Apache Spark and PySpark
No ratings yet
Understanding Apache Spark and PySpark
85 pages
Hadoop Interview Questions & Answers Guide
100% (2)
Hadoop Interview Questions & Answers Guide
25 pages
Deploying Workloads with Databricks
No ratings yet
Deploying Workloads with Databricks
19 pages
PySpark Optimization Techniques Guide
No ratings yet
PySpark Optimization Techniques Guide
1 page
Apache Airflow 2.0 Features Overview
100% (2)
Apache Airflow 2.0 Features Overview
39 pages
Apache Spark Architecture Overview
No ratings yet
Apache Spark Architecture Overview
7 pages
Simulado Databricks
No ratings yet
Simulado Databricks
25 pages
Hadoop Administrator Interview Questions: Cloudera® Enterprise Version
No ratings yet
Hadoop Administrator Interview Questions: Cloudera® Enterprise Version
13 pages
Top 55 Apache Spark Interview Questions
No ratings yet
Top 55 Apache Spark Interview Questions
10 pages
Hive Cheat Sheet - Quick Reference
No ratings yet
Hive Cheat Sheet - Quick Reference
19 pages
Key Features of PySpark Explained
No ratings yet
Key Features of PySpark Explained
19 pages
BigQuery SQL Joins: Questions & Answers
100% (1)
BigQuery SQL Joins: Questions & Answers
5 pages
Snowflake - Interview Questions
No ratings yet
Snowflake - Interview Questions
15 pages
Databricks Spark 3.0 Exam Q&A Guide
No ratings yet
Databricks Spark 3.0 Exam Q&A Guide
4 pages
BigQuery Data Insertion and External Tables
100% (1)
BigQuery Data Insertion and External Tables
8 pages
Top SQL Interview Questions for Analysts
No ratings yet
Top SQL Interview Questions for Analysts
20 pages
BigQuery Query Optimization Techniques
No ratings yet
BigQuery Query Optimization Techniques
10 pages
BigQuery Interview Questions for Data Engineers
No ratings yet
BigQuery Interview Questions for Data Engineers
4 pages
PySpark Interview Questions and Scenarios
0% (1)
PySpark Interview Questions and Scenarios
3 pages
Overview of Spark Architecture
No ratings yet
Overview of Spark Architecture
25 pages
Best Practices for Spark SQL Bucketing
No ratings yet
Best Practices for Spark SQL Bucketing
27 pages
Databricks Data Engineering Commands Guide
No ratings yet
Databricks Data Engineering Commands Guide
39 pages
Azure Databricks Comprehensive Overview
No ratings yet
Azure Databricks Comprehensive Overview
5 pages
Managing Snowflake Virtual Warehouses
No ratings yet
Managing Snowflake Virtual Warehouses
40 pages
Spark SQL Features and Data Loading
No ratings yet
Spark SQL Features and Data Loading
96 pages
Data Engineering Internship Assessment
No ratings yet
Data Engineering Internship Assessment
4 pages
S3 Bucket Types and Spark Optimization Techniques
No ratings yet
S3 Bucket Types and Spark Optimization Techniques
7 pages
Databricks Data Engineer Certification Q&A
No ratings yet
Databricks Data Engineer Certification Q&A
50 pages
Master PySpark: DataFrame Operations Guide
No ratings yet
Master PySpark: DataFrame Operations Guide
106 pages
? Create The ROOT - DEPTH Table - ESS-DWW Courseware - Snowflake University - On-Demand
No ratings yet
? Create The ROOT - DEPTH Table - ESS-DWW Courseware - Snowflake University - On-Demand
7 pages
02 - Apache Spark On Amazon EMR
No ratings yet
02 - Apache Spark On Amazon EMR
31 pages
Introduction to Apache Spark Overview
No ratings yet
Introduction to Apache Spark Overview
19 pages
Spark SQL and Hadoop Integration Guide
100% (1)
Spark SQL and Hadoop Integration Guide
25 pages
Introduction to Apache Spark Basics
No ratings yet
Introduction to Apache Spark Basics
37 pages
Spark Optimization Techniques Handbook
No ratings yet
Spark Optimization Techniques Handbook
7 pages
Mastering BigQuery: A Comprehensive Guide
No ratings yet
Mastering BigQuery: A Comprehensive Guide
8 pages
Big Data Engineer Interview Questions
No ratings yet
Big Data Engineer Interview Questions
1 page
Aws Glue Information
No ratings yet
Aws Glue Information
46 pages
ETL Operations in Azure Databricks
No ratings yet
ETL Operations in Azure Databricks
5 pages
Azure Databricks Interview Guide
No ratings yet
Azure Databricks Interview Guide
7 pages
Introduction to Apache Spark Overview
No ratings yet
Introduction to Apache Spark Overview
27 pages
Introduction to Apache Spark Framework
No ratings yet
Introduction to Apache Spark Framework
30 pages
PySpark: DataFrames and Operations Guide
No ratings yet
PySpark: DataFrames and Operations Guide
9 pages
Cloudera Spark Certification Overview
No ratings yet
Cloudera Spark Certification Overview
25 pages
Design for Reliability Overview
No ratings yet
Design for Reliability Overview
180 pages
ZX200 - 270W (2) Hitachi
No ratings yet
ZX200 - 270W (2) Hitachi
7 pages
CHHS 6165y 59
No ratings yet
CHHS 6165y 59
2 pages
A Tutorial On Pilot Studies The What Why and How
No ratings yet
A Tutorial On Pilot Studies The What Why and How
2 pages
Importance of Packaging in Entrepreneurship
No ratings yet
Importance of Packaging in Entrepreneurship
3 pages
Review On Towards Deep Learning Models Resistant To Adversarial Attacks
No ratings yet
Review On Towards Deep Learning Models Resistant To Adversarial Attacks
32 pages
Goodwill Accounting and Merger Analysis
No ratings yet
Goodwill Accounting and Merger Analysis
2 pages
UN Security Council Mission Report Yemen
No ratings yet
UN Security Council Mission Report Yemen
11 pages
34 (6) - Rank (44) (Apply For Manager Authorization)
No ratings yet
34 (6) - Rank (44) (Apply For Manager Authorization)
3 pages
Data Privacy Rights and Responsibilities
No ratings yet
Data Privacy Rights and Responsibilities
2 pages
Mapping Leadership and Management Styles
No ratings yet
Mapping Leadership and Management Styles
13 pages
ANM Registration Application Overview
No ratings yet
ANM Registration Application Overview
3 pages
C Program Compilation Guide
No ratings yet
C Program Compilation Guide
47 pages
Knitted Linden Wrap Pattern Guide
No ratings yet
Knitted Linden Wrap Pattern Guide
3 pages
Quick-Start Budget Guide
100% (1)
Quick-Start Budget Guide
11 pages
Herbal Entrepreneurship Development Guide
No ratings yet
Herbal Entrepreneurship Development Guide
10 pages
NH State Trial Witness List for Dodge Case
No ratings yet
NH State Trial Witness List for Dodge Case
1 page
Overview of GPON Technology and Architecture
No ratings yet
Overview of GPON Technology and Architecture
8 pages
Reflection Coefficient and Impedance in EC258
No ratings yet
Reflection Coefficient and Impedance in EC258
66 pages
Fall 2019 Week 1 Chapter 27 Homework Solutions
No ratings yet
Fall 2019 Week 1 Chapter 27 Homework Solutions
7 pages
BSNL Postpaid Bill Summary and Payment Guide
No ratings yet
BSNL Postpaid Bill Summary and Payment Guide
2 pages
ICT in Education Final Exam Guide
No ratings yet
ICT in Education Final Exam Guide
4 pages
Schengen Visa Requirements for Filipinos
No ratings yet
Schengen Visa Requirements for Filipinos
3 pages
Mentalidad de Titán
No ratings yet
Mentalidad de Titán
223 pages
OT Light Calibration Certificates
No ratings yet
OT Light Calibration Certificates
8 pages
Chrome vs. Carbide Coatings for Pulp Rolls
No ratings yet
Chrome vs. Carbide Coatings for Pulp Rolls
7 pages
2 02 - Serial Bus Communication Protocols
No ratings yet
2 02 - Serial Bus Communication Protocols
62 pages
EE5311: Transistor Fundamentals
No ratings yet
EE5311: Transistor Fundamentals
41 pages
UkuFunda IziKronike 2:13-14
No ratings yet
UkuFunda IziKronike 2:13-14
8 pages