PySpark Notes
PySpark Notes
● Volume:
The size of data is staggering—petabytes, even zettabytes. Traditional databases
collapse under such weight. Imagine tracking every grain of sand on Earth; that’s the
challenge of volume.
● Variety:
Data doesn’t come neatly packaged. There’s structured data (like tables), semi-
structured data (like JSON or XML), and unstructured data (like videos or audio). A
single event, such as booking a flight, generates multiple types of data.
● Velocity:
Data moves at blinding speed, often in real-time. A streaming video platform like Netflix
gathers user data every second, which must be processed immediately to provide
recommendations or maintain quality.
● Veracity:
Not all data is reliable. Noise, inconsistencies, or outright inaccuracies can mislead
decision-makers. For example, social media data may contain spam or fake reviews.
For instance, imagine slicing a massive cake (data) and sharing it among hundreds of bakers
(machines). Together, they process the cake faster than one baker could. Frameworks like
Hadoop’s HDFS (Hadoop Distributed File System) lay the foundation for this approach.
● In-Memory Computing: Unlike older tools that write intermediate data to disk, Spark
keeps data in memory. This boosts speed dramatically, particularly for iterative tasks like
machine learning.
● Unified Platform: Spark handles batch processing (like Hadoop), real-time streaming,
machine learning, and graph analytics—all in one tool.
● Polyglot: Spark speaks multiple programming languages: Python (PySpark), Java,
Scala, and R, making it accessible to diverse developers.
Today, Spark powers everything from ride-sharing platforms (Uber) to streaming services
(Netflix).
Use Cases:
Still, Spark can run on Hadoop’s distributed storage (HDFS), making them allies when needed.
Another important fact is that input and output are stored in different formats, spark has
connectors to read and write those. But that means to serialize and deserialize them. While
being transparent, serialization is often the most expensive operation.
Finally, spark tries to keep data in memory for processing but it will [ser/deser]ialize data on
each worker locally when it doesn't fit in memory. Once again, it is done transparently but can
be costly
RDD
It's the first API provided by spark. To put is simply it is a not-ordered sequence of scala/java
objects distributed over a cluster. All operations executed on it are jvm methods (passed to
map, flatmap, groupBy, ...) that need to be serialized, send to all workers, and be applied to the
jvm objects there. This is pretty much the same as using a scala Seq, but distributed. It is
strongly typed, meaning that "if it compiles then it works" (if you don't cheat). However, there are
lots of distribution issues that can arise. Especially if spark doesn't know how to [de]serialize the
jvm classes and methods.
Dataframe
It came after and is semantically very different from RDD. The data are considered as tables
and operations such as sql operations can be applied on it. It is not typed at all, so error can
arise at any time during execution. However, there are I think 2 pros: (1) many people are used
to the table/sql semantic and operations, and (2) spark doesn't need to deserialize the whole
line to process one of its column, if the data format provide suitable column access. And many
do, such as the parquet file format that is the most commonly used.
Dataset
It is an improvement of Dataframe to bring some type-safety. Dataset are dataframe to which
we associate an "encoder" related to a jvm class. So spark can check that the data schema is
correct before executing the code. Note however that, we can read sometime that dataset are
strongly type, but it is not: it brings some strongly type safety where you cannot compile code
that use a Dataset with a type that is not what has been declared. But it is very easy to make
code that compile but still fail at runtime. This is because many dataset operations loose the
type (pretty much everything apart from filter). Still it is a huge improvements because even
when we make mistake, it will fail fast: failure happens when interpreting the spark DAG (i.e. at
start) instead of during data processing.
● Dataset:
● pros: has optimized operations over column oriented storages
● pros: also many operations doesn't need deserialization
● pros: provide table/sql semantic if you like it (I don't ;)
● pros: dataset operations comes with an optimization engine
"catalyst" that improves the performance of your code. I'm not sure
however if it is really that great. If you know what you code, i.e. what
is done to the data, your code should be optimized by itself.
● cons: most operation loose typing
● cons: dataset operations can become too complicated for complex
algorithm that doesn't suit it. The 2 main limits I know are managing
invalid data and complex math algorithm.
● Dataframe:
● pros: required between dataset operations that lose type
● cons: just use Dataset it has all the advantages and more
● RDD:
● pros: (really) strongly typed
● pros: scala/java semantic. You can design your code pretty much
how you would for a single-jvm app that process in-memory
collections. Well, with functional semantic :)
● cons: full jvm deserialization is required to process data, at any step
mentioned before: after reading input, and between all processing
steps that requires data to be moved between worker, or stored
locally to manage memory bound.
# import all
from pyspark.sql.types import *
from pyspark.sql.functions import *
Because some imported functions might override Python built-in functions, some users choose
to import these modules using an alias. The following examples show a common alias used in
Apache Spark code examples:
import pyspark.sql.types as T
import pyspark.sql.functions as F
Create a DataFrame
There are several ways to create a DataFrame. Usually you define a DataFrame against a data
source such as a table or collection of files. Then as described in the Apache Spark
fundamental concepts section, use an action, such as display, to trigger the transformations to
execute. The display method outputs DataFrames.
df_children = spark.createDataFrame(
data = [("Mikhail", 15), ("Zaky", 13), ("Zoya", 8)],
schema = ['name', 'age'])
display(df_children)
Notice in the output that the data types of columns of df_children are automatically inferred. You
can alternatively specify the types by adding a schema. Schemas are defined using the
StructType which is made up of StructFields that specify the name, data type and a boolean flag
indicating whether they contain a null value or not. You must import data types from
pyspark.sql.types.
filtered_df.show()
Explanation:
● filter() filters rows based on the condition (age > 30 and department ==
"Engineering").
● select() chooses specific columns to display.
● orderBy() sorts the results by salary in descending order.
● col() references a column to keep the code cleaner and avoid hardcoding
column names.
aggregated_df.show()
Explanation:
3. Joining DataFrames
Scenario:
Assume you have another dataset with department managers. Join this data with the
original dataset to include manager names.
joined_df.show()
Explanation:
● fillna() replaces null (missing) values with the specified value (e.g., 0 for
"bonus").
Explanation:
Explanation:
def years_to_retirement(age):
retirement_df = df.withColumn("years_to_retirement",
retirement_udf(col("age")))
Explanation:
def city_to_region(city):
return "East"
return "West"
else:
return "Unknown"
Explanation:
7. Window Functions
Scenario:
You want to rank employees within each department by their salary.
Explanation:
Here are examples of various Window Functions using the same dataset:
Explanation:
Explanation:
Explanation:
Explanation:
Explanation:
Explanation:
Explanation:
8. Percent Rank
Scenario: Calculate the percentile rank of each employee's salary within their
department.
from pyspark.sql.functions import percent_rank
Explanation:
# Write SQL query to fetch the top 3 employees with the highest
salaries
top_3_employees = spark.sql("""
SELECT id, name, department, salary
FROM employees
ORDER BY salary DESC
LIMIT 3
""")
Explanation:
Explanation:
Explanation:
Explanation:
1. Exploding Data:
○ explode(col("skills")): Breaks an array column (skills) into multiple
rows, one per element.
○ Each skill for every employee appears in its own row.
2. Collecting Data:
○ collect_list("skill"): Aggregates all skill values into an array for
each employee.
○ Results in a grouped structure with arrays.
6.Optimization in DataFrames
1.1 Predicate Pushdown
Scenario: You are reading a large Parquet file of employee data and want to filter out only
employees from the IT department.
Spark pushes the filter (department == "IT") to the data source (Parquet file), reducing the
amount of data read into Spark.
explain(True): Shows the physical execution plan, confirming that filtering happens during the
file scan.
# Create DataFrames
df1 = df.select("id", "name", "department")
df2 = df.select("department", "city")
# Repartition both DataFrames by department
df1 = df1.repartition("department")
df2 = df2.repartition("department")
# Perform a join
joined_df = df1.join(df2, "department")
salted_joined_df.show()
Explanation:
Salting: Adding a pseudo-random key (salt) distributes the skewed data across partitions,
balancing the workload during joins.
2. Performance Tuning
broadcast(): Forces Spark to broadcast the smaller DataFrame (df2) to all executors, avoiding
expensive shuffles during the join.
# Example query
query_plan_df = df.groupBy("department").agg({"salary": "avg"})
Catalyst Optimizer transforms the query into logical, optimized logical, and physical plans.
explain(True): Visualizes the entire transformation process.
Tungsten generates optimized bytecode for better CPU efficiency, reducing runtime overhead.
4. Optimizing UDFs
Scenario: Replace a Python UDF with a native Spark SQL function.
from pyspark.sql.functions import col, upper
df_transformed.show()
id,name,department,city,salary
1,John,IT,New York,80000
2,Jane,Finance,San Francisco,90000
3,Mark,HR,Chicago,70000
4,Linda,IT,Boston,85000
5,James,Finance,New York,95000
6,Susan,IT,San Francisco,75000
7,Robert,HR,Boston,72000
8,Karen,IT,New York,88000
9,Michael,Finance,Chicago,87000
10,Sarah,HR,San Francisco,68000
PySpark Code for the CSV
Here’s how to load and work with this data:
Key Notes:
CSV File: Ensure the employees.csv file is in the same directory where you're running the script
or provide the absolute file path in the file_path variable.
Code Structure:
Each example builds on the loaded DataFrame (df) from the CSV.
Optimizations (predicate pushdown, repartitioning, etc.) are showcased step-by-step.
Reusable Data: All examples use the same CSV data to keep consistency across operations.
Clear Example
Scenario: Joining two DataFrames (df1 and df2) on the department column.
Without Partitioning
from pyspark.sql import SparkSession
# Sample DataFrames
data1 = [(1, "John", "IT"), (2, "Jane", "Finance"), (3, "Mark", "HR"),
(4, "Linda", "IT")]
data2 = [("IT", "New York"), ("Finance", "San Francisco"), ("HR",
"Chicago")]
# Join operation
joined_df = df1.join(df2, "department")
joined_df.explain(True)
joined_df.show()
Explanation:
● Shuffling Happens: Since df1 and df2 are not pre-partitioned on the department
column, Spark shuffles data across partitions to align department keys.
Output of explain(True):
● You will see a shuffle exchange stage, indicating data movement across partitions.
With Partitioning
python
Copy code
# Repartition both DataFrames by the join key (department)
df1 = df1.repartition("department")
df2 = df2.repartition("department")
# Join operation
joined_df = df1.join(df2, "department")
joined_df.explain(True)
joined_df.show()
Explanation:
● No Shuffle Exchange: By repartitioning both DataFrames on department, Spark
ensures that data with the same department value is colocated in the same partition.
This avoids expensive shuffling.