Spark All Optimizations & Code
Spark All Optimizations & Code
& Code
Ganesh R
Azure Data Engineer
Lets discuss all the Spark Optimizations which are required to perform
in your spark project. Also these are extremely important for you
interview preparation.
Let’s discuss everything one by one with code.
Code Example
Code Example
Code Example
# Broadcasting a variable
broadcastVar = sc.broadcast([1, 2, 3])
Shuffles are expensive operations that involve moving data across the
cluster. Minimizing shuffles by using map-side combine or careful
partitioning can significantly improve performance.
Code Example
Using columnar storage formats like Parquet or ORC can improve read
performance by allowing Spark to read only the necessary columns. These
formats also support efficient compression and encoding schemes.
Code Example
Predicate pushdown allows Spark to filter data at the data source level
before loading it into memory, reducing the amount of data transferred
and improving performance.
Code Example
Code Example
@pandas_udf("double", PandasUDFType.SCALAR)
def vectorized_udf(x):
return x + 1
df.withColumn("new_column",
vectorized_udf(df["existing_column"])).show()
Code Example
Code Example
Code Example
Code Example
Broadcast joins are more efficient than shuffle joins when one of the
DataFrames is small, as the small DataFrame is broadcasted to all nodes,
avoiding shuffles.
Code Example
# Broadcast join
from pyspark.sql.functions import broadcast
df = df1.join(broadcast(df2), df1["key"] == df2["key"])
Code Example
Code Example
Speculative execution re-runs slow tasks in parallel and uses the result of
the first completed task, helping to mitigate the impact of straggler tasks.
Code Example
Code Example
# Enabling AQE
spark.conf.set("spark.sql.adaptive.enabled", "true")
Code Example
Code Example
Code Example
Code Example
Code Example
Ganesh R
Senior Azure Data Engineer