0% found this document useful (0 votes)

218 views8 pages

Spark 3.0 Key Features Overview

Spark 3.0 introduces several new features including GPU support, Cypher query support for graph processing, binary files as a data source, dynamic partition pruning, tree-based feature transformation, Kafka header support in structured streaming, support for JDK 11, improved Kubernetes integration, analyzing cached data, a more readable physical plan explanation, dynamic allocation without an external shuffle service, and logistic loss support for classification. It also includes several performance improvements like a new compression codec and higher order functions.

Uploaded by

Mohammed Hussein

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

218 views8 pages

Spark 3.0 Key Features Overview

Uploaded by

Mohammed Hussein

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Spark 3.

0 New Features
Spark 3.0 is coming with several important features, but here I list very few notable
features only.

Spark with GPU Support

● Make Spark 3.0 GPU-aware in Spark cluster managers
● No regression on scheduler performance for normal jobs.

Read details here .

Spark Graph with Cypher Support

This is a complete package of Graph along with Property Graph and Cypher Script.

This project is inherited from morpheus & will fully support DataFrame.
Now Graph query will have its own Catalysts & it will follow similar principles as
SparkSQL.

Read Details here.

Binary Files as a Data Source

Spark 3.0 is bringing Binary Files as a core Data Source .

It would be useful to have a data source implementation for binary files, which can be
used to build features to load images, audio, and videos.

val df = [Link](BINARY_FILE).load([Link])

Binary Format doesn’t support Write Operation & Currently it supports less than 2 GB
Data file size

[Link]@[Link]
Dynamic Partition Pruning

Databricks Presentation on this.

Tree Based Feature Transformation

Finally Spark ML is equipped with Decision Tree.
This is inspired by following-

[Link]
XGBoosting
LightGBM
Catboost

[Link]@[Link]
Kafka Header Support in Structured
Streaming

Kafka Header from 0.11.0 is now available with Structured Streaming.

val df = spark
.readStream
.format("kafka")
.option("[Link]", "host1:port1,host2:port2")
.option("subscribe", "topic1")
.option("includeHeaders", "true")
.load()
[Link]("CAST(key AS STRING)", "CAST(value AS STRING)", "headers")
.as[(String, String, Map)]

Spark with JDK 11

Spark now fully supports JDK 11, considering end of Life of JDK 8 and JDK 9/10 is as
well not that active.

JDK 11 performance is slower than JDK 8

[Link]
[Link]

Kubernetes Related Features

Following are few notable features( I found them interesting ) implemented in Spark
3.0 -

● User Specific Pod Templating

● Bug Fix on OpenJDK Docker Image Supported

[Link]@[Link]
● Latest Version of Kubernetes is now Supported
● [Link] can now actually be passed to
Executor & python3 is now default
● Improvement on Dynamic Allocation with Kubernetes
● Daemon Thread Blocking JVM Exit automatically because of Kubernetes
Client bug
● Config Requests for Driver pod
● Kubernetes support for GPU-aware scheduling
● RegisteredExecutors reload supported after ExternalShuffleService restart
● Configurable auth secret source in k8s backend
● Support automatic [Link] secret in Kubernetes backend
● Subpath Mounting in Kubernetes
● Kerberos Support in Kubernetes resource manager
● Supporting emptyDir Volume/tmpfs
● More mature spark-submit with K8s

Cache Data can be Analyzed

This is a feature I always wanted and now with Spark 3.0 , its possible to analyze
cached table.
Spark can analyze cached data and hold temporary column statistics for
InMemoryRelation.

sql(
"""CACHE TABLE cachedQuery AS
| SELECT c0, avg(c1) AS v1, avg(c2) AS v2
| FROM (SELECT id % 3 AS c0, id % 5 AS c1, 2 AS c2 FROM range(1, 30))
| GROUP BY c0
""".stripMargin)

// Analyzes one column in the cached logical plan

sql("ANALYZE TABLE cachedQuery COMPUTE STATISTICS FOR COLUMNS v1")
)

[Link]@[Link]
Spark 3.0 has now Improved More readable
Explain
Spark Explain for physical Plan was very complex to understand, but now Explain is
much more readable.

Old Format:
*(2) Project [key#2, max(val)#15]
+- *(2) Filter (isnotnull(max(val#3)#18) AND (max(val#3)#18 > 0))
+- *(2) HashAggregate(keys=[key#2], functions=[max(val#3)], output=[key#2, max(val)#15,
max(val#3)#18])
+- Exchange hashpartitioning(key#2, 200)
+- *(1) HashAggregate(keys=[key#2], functions=[partial_max(val#3)], output=[key#2, max#21])
+- *(1) Project [key#2, val#3]
+- *(1) Filter (isnotnull(key#2) AND (key#2 > 0))
+- *(1) FileScan parquet default.explain_temp1[key#2,val#3] Batched: true, DataFilters:
[isnotnull(key#2), (key#2 > 0)], Format: Parquet, Location:
InMemoryFileIndex[file:/user/hive/warehouse/explain_temp1], PartitionFilters: [], PushedFilters:
[IsNotNull(key), GreaterThan(key,0)], ReadSchema: struct<key:int,val:int>

New Format:

Project (8)
+- Filter (7)
+- HashAggregate (6)
+- Exchange (5)
+- HashAggregate (4)
+- Project (3)
+- Filter (2)
+- Scan parquet default.explain_temp1 (1)

(1) Scan parquet default.explain_temp1 [codegen id : 1]

Output: [key#2, val#3]

(2) Filter [codegen id : 1]

Input : [key#2, val#3]
Condition : (isnotnull(key#2) AND (key#2 > 0))

(3) Project [codegen id : 1]

Output : [key#2, val#3]
Input : [key#2, val#3]

[Link]@[Link]
(4) HashAggregate [codegen id : 1]
Input: [key#2, val#3]

(5) Exchange
Input: [key#2, max#11]

(6) HashAggregate [codegen id : 2]

Input: [key#2, max#11]

(7) Filter [codegen id : 2]

Input : [key#2, max(val)#5, max(val#3)#8]
Condition : (isnotnull(max(val#3)#8) AND (max(val#3)#8 > 0))

(8) Project [codegen id : 2]

Output : [key#2, max(val)#5]
Input : [key#2, max(val)#5, max(val#3)#8]

Dynamic Allocation without Any External

Shuffle Service

Dynamic Allocation is now Possible without any External Shuffle Service & that
brings Dynamic Allocation in K8s .
Currently with Spark 3.0, Dynamic Allocation is not fully available with Kuberenetes.

RobustScalar for Spark

RobustScaler is a kind of widely-used scaler, which use median/IQR to replace mean/std

in StandardScaler. Scale features using statistics that are robust to outliers.

Inspired by Sklearn [Link]

[Link]/stable/modules/generated/[Link]#sklearn.p
[Link]

[Link]@[Link]
Logistic Loss is now supported in Spark
Logistic Loss, a widely known Metrics for Classification Task is now available.

val df = [Link]([Link](probabilities)).map {
case (label, probability) =>
val prediction = [Link]
(prediction, label, probability)
}.toDF("prediction", "label", "probability")

val evaluator = new MulticlassClassificationEvaluator()

.setMetricName("logLoss")

Spark New Compression Code

ZStd codec directly, we use Spark's CompressionCodec which wraps ZStd codec in a
buffered stream to avoid overhead excessive of JNI call while trying to
compress/decompress small amount of data.

This will lead to Faster Performance.

New Higher Order Functions

● Map_entries
● Map_filter
● Map_zip_with
● Transform_keys
● Transform_values
● filter(array<T>, function<T, Int, boolean>) → array<T>

[Link]@[Link]
Executor Memory Metrics are now Available
Executor Metrics information will help provide insight into how executor and driver JVM
memory is used, and for the different memory regions. It can be used to help determine
good values for [Link], [Link], [Link],
and [Link].

Currently its available with Prometheus.

$ bin/spark-shell --master spark://`hostname`:7077 --conf

[Link]=true

Check all available resources here.

There is still a lot more to do, I will update that separately

CSV/JSON migrated to Datasource V2

Reference Databricks
As a general computing engine, Spark can process data from various data
management/storage systems, including HDFS, Hive, Cassandra and Kafka. For flexibility
and high throughput, Spark defines the Data Source API, which is an abstraction of the
storage layer. The Data Source API has two requirements.

1) Generality: support reading/writing most data management/storage systems.

2) Flexibility: customize and optimize the read and write paths for different systems
based on their capabilities.

Check further here

1
[Link]@[Link]

[Link]@[Link]

Native SQL Support in Apache Spark
No ratings yet
Native SQL Support in Apache Spark
27 pages
New Features in Apache Spark 3.0
No ratings yet
New Features in Apache Spark 3.0
8 pages
Big Data Processing with Hadoop & Spark
No ratings yet
Big Data Processing with Hadoop & Spark
67 pages
Apache Spark: RDDs and Data Processing
No ratings yet
Apache Spark: RDDs and Data Processing
6 pages
Overview of Spark SQL Features and Usage
No ratings yet
Overview of Spark SQL Features and Usage
74 pages
Overview of Apache Spark and Hadoop
No ratings yet
Overview of Apache Spark and Hadoop
69 pages
Understanding Spark SQL and DataFrames
No ratings yet
Understanding Spark SQL and DataFrames
74 pages
Apache Spark Overview and Use Cases
No ratings yet
Apache Spark Overview and Use Cases
19 pages
Key Features of Apache Spark 2.3
No ratings yet
Key Features of Apache Spark 2.3
57 pages
Introduction to Apache Spark and Sqoop
No ratings yet
Introduction to Apache Spark and Sqoop
76 pages
Spark SQL: Optimization and Features
No ratings yet
Spark SQL: Optimization and Features
24 pages
Introduction to Apache Spark Architecture
No ratings yet
Introduction to Apache Spark Architecture
55 pages
Analyzing Large Datasets with Spark
No ratings yet
Analyzing Large Datasets with Spark
11 pages
Apache Spark & Scala: Week 1 Overview
No ratings yet
Apache Spark & Scala: Week 1 Overview
16 pages
Apache Spark Overview and RDD Insights
100% (2)
Apache Spark Overview and RDD Insights
120 pages
PySpark Overview and Data Handling Guide
No ratings yet
PySpark Overview and Data Handling Guide
64 pages
Delta Format and Spark Optimization Techniques
No ratings yet
Delta Format and Spark Optimization Techniques
4 pages
Spark Shell: Local Mode Guide
No ratings yet
Spark Shell: Local Mode Guide
56 pages
Apache Spark Internal Workings Explained
No ratings yet
Apache Spark Internal Workings Explained
17 pages
Apache Spark vs. Hadoop MapReduce
No ratings yet
Apache Spark vs. Hadoop MapReduce
33 pages
Introduction to Apache Spark Features
No ratings yet
Introduction to Apache Spark Features
33 pages
Apache Spark Batch Processing Insights
No ratings yet
Apache Spark Batch Processing Insights
48 pages
Introduction to Apache Spark Framework
No ratings yet
Introduction to Apache Spark Framework
30 pages
Spark SQL and Big Data Processing Guide
No ratings yet
Spark SQL and Big Data Processing Guide
19 pages
Advanced Concepts in Apache Spark
50% (2)
Advanced Concepts in Apache Spark
49 pages
Introduction to Apache Spark Architecture
No ratings yet
Introduction to Apache Spark Architecture
96 pages
Spark SQL Features and Data Loading
No ratings yet
Spark SQL Features and Data Loading
96 pages
Overview of Apache Spark Architecture
No ratings yet
Overview of Apache Spark Architecture
44 pages
Spark Memory Overhead Factor Explained
No ratings yet
Spark Memory Overhead Factor Explained
37 pages
CO - 5 Apache Spark
No ratings yet
CO - 5 Apache Spark
13 pages
Spark Performance Tuning Techniques
No ratings yet
Spark Performance Tuning Techniques
41 pages
SparkSql AND DF
No ratings yet
SparkSql AND DF
89 pages
Understanding Spark and Hadoop Integration
No ratings yet
Understanding Spark and Hadoop Integration
4 pages
Spark
No ratings yet
Spark
96 pages
Apache Spark Overview and Architecture
No ratings yet
Apache Spark Overview and Architecture
58 pages
Spark SQL and Streaming Overview
No ratings yet
Spark SQL and Streaming Overview
63 pages
Spark Summit East 2015 Overview
No ratings yet
Spark Summit East 2015 Overview
219 pages
Introduction to Spark SQL Features
No ratings yet
Introduction to Spark SQL Features
19 pages
Spark RDDs: Efficient Data Processing
No ratings yet
Spark RDDs: Efficient Data Processing
33 pages
EY & Deloitte Data Engineer Interview Guide
No ratings yet
EY & Deloitte Data Engineer Interview Guide
26 pages
HDP Developer Apache Pig and Hive
No ratings yet
HDP Developer Apache Pig and Hive
42 pages
Pyspark Swapon
No ratings yet
Pyspark Swapon
10 pages
Overview of Apache Spark Features
No ratings yet
Overview of Apache Spark Features
6 pages
Overview of Apache Spark Components
No ratings yet
Overview of Apache Spark Components
24 pages
Bda 4
No ratings yet
Bda 4
16 pages
Spark Data Processing Techniques Guide
No ratings yet
Spark Data Processing Techniques Guide
19 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Understanding Apache Spark and RDDs
100% (3)
Understanding Apache Spark and RDDs
197 pages
Introduction to Apache Spark for Big Data
No ratings yet
Introduction to Apache Spark for Big Data
42 pages
RDDs and DataFrames in Spark
No ratings yet
RDDs and DataFrames in Spark
38 pages
Apache Spark SQL Overview and Features
No ratings yet
Apache Spark SQL Overview and Features
21 pages
Delta Format and Spark SQL Overview
No ratings yet
Delta Format and Spark SQL Overview
5 pages
Introduction to Apache Spark Concepts
No ratings yet
Introduction to Apache Spark Concepts
55 pages
PySpark Essentials: A Quick Guide
No ratings yet
PySpark Essentials: A Quick Guide
190 pages
Overview of Spark and MapReduce
No ratings yet
Overview of Spark and MapReduce
37 pages
Module 7
No ratings yet
Module 7
30 pages
Introduction to Apache Spark Overview
No ratings yet
Introduction to Apache Spark Overview
27 pages
Apache Spark Architecture and RDD Overview
No ratings yet
Apache Spark Architecture and RDD Overview
42 pages
PySpark Interview Questions Overview
No ratings yet
PySpark Interview Questions Overview
16 pages
Analysis and Prediction of Diabetes Using Machine Learning
No ratings yet
Analysis and Prediction of Diabetes Using Machine Learning
9 pages
A Comparative Study of Classification Algorithms For Diseases Prediction in Medical Domain
No ratings yet
A Comparative Study of Classification Algorithms For Diseases Prediction in Medical Domain
5 pages
Diabetes Prediction Using Machine Learning
No ratings yet
Diabetes Prediction Using Machine Learning
6 pages
Prediction of Diabetes Using Classi Cation Algorithms
No ratings yet
Prediction of Diabetes Using Classi Cation Algorithms
8 pages
MIT6 0001F16 Lec4 PDF
No ratings yet
MIT6 0001F16 Lec4 PDF
35 pages
MIT Python Lec3
No ratings yet
MIT Python Lec3
21 pages
MIT6 0001F16 Lec1
No ratings yet
MIT6 0001F16 Lec1
35 pages
Profile of Mohammed Hussein Omar
No ratings yet
Profile of Mohammed Hussein Omar
2 pages
Mohammed Hussein Omar's Profile
No ratings yet
Mohammed Hussein Omar's Profile
2 pages
Photoinduced Charge Separation in Tetrads
No ratings yet
Photoinduced Charge Separation in Tetrads
18 pages
SL1100 Terminal Quick Reference Guide
No ratings yet
SL1100 Terminal Quick Reference Guide
1 page
DNV Ship Hull Equipment Rules 2006
No ratings yet
DNV Ship Hull Equipment Rules 2006
85 pages
Fixed-Bed Platforming Catalyst Regeneration Guide
No ratings yet
Fixed-Bed Platforming Catalyst Regeneration Guide
64 pages
Thermal Radiation Heat Transfer Review
No ratings yet
Thermal Radiation Heat Transfer Review
2 pages
Time, Distance, and Longitude Calculations
No ratings yet
Time, Distance, and Longitude Calculations
23 pages
Transistor Amplifier Concepts in Physics
No ratings yet
Transistor Amplifier Concepts in Physics
4 pages
Pana Acoperis 60x40 Design Report
No ratings yet
Pana Acoperis 60x40 Design Report
4 pages
COPRAS Method with Neutrosophic Sets
No ratings yet
COPRAS Method with Neutrosophic Sets
15 pages
Unlocking Codes for LG Phones
No ratings yet
Unlocking Codes for LG Phones
6 pages
Potential in a Cylindrical Capacitor
No ratings yet
Potential in a Cylindrical Capacitor
2 pages
Improvements On Speech Recogniton For Fast Talkers
No ratings yet
Improvements On Speech Recogniton For Fast Talkers
5 pages
LG Polymer Microparticle Drug Characterization
No ratings yet
LG Polymer Microparticle Drug Characterization
8 pages
Advanced Civil 3D Corridor Techniques
No ratings yet
Advanced Civil 3D Corridor Techniques
56 pages
Fundamentals of Fire Protection R2
No ratings yet
Fundamentals of Fire Protection R2
64 pages
Supernova Life Cycle Explained
No ratings yet
Supernova Life Cycle Explained
3 pages
01 OptiX RTN900 System Description
No ratings yet
01 OptiX RTN900 System Description
106 pages
BS en Iso 07346-2-1998 (1999) BS 6068-5.3-1998
No ratings yet
BS en Iso 07346-2-1998 (1999) BS 6068-5.3-1998
18 pages
Cast Iron Classification Overview
No ratings yet
Cast Iron Classification Overview
13 pages
BHEL Supervisor Trainee Civil Exam 2024
No ratings yet
BHEL Supervisor Trainee Civil Exam 2024
61 pages
Virtual Anchor Lengths in Pipelines
100% (2)
Virtual Anchor Lengths in Pipelines
22 pages
Design of Low Power 12-Bit Magnitude Comparator
No ratings yet
Design of Low Power 12-Bit Magnitude Comparator
6 pages
CFD in Heart Valve Design
No ratings yet
CFD in Heart Valve Design
34 pages
Understanding Memory Management Techniques
No ratings yet
Understanding Memory Management Techniques
6 pages
AP Stats: Airline Overbooking & Tuition Analysis
No ratings yet
AP Stats: Airline Overbooking & Tuition Analysis
4 pages
Rhombus Perimeter Calculation Guide
No ratings yet
Rhombus Perimeter Calculation Guide
15 pages
Quantum Gravity: The Divine Connection
100% (2)
Quantum Gravity: The Divine Connection
4 pages
SCAMPER: Creative Idea Generation Guide
No ratings yet
SCAMPER: Creative Idea Generation Guide
4 pages
AVL Trees: Balancing and Operations
No ratings yet
AVL Trees: Balancing and Operations
55 pages
Motor Sirens F0 II III - EN - A
No ratings yet
Motor Sirens F0 II III - EN - A
2 pages

Spark 3.0 Key Features Overview

Uploaded by

Spark 3.0 Key Features Overview

Uploaded by

Spark 3.

Spark with GPU Support

Read details here .

Spark Graph with Cypher Support

Read Details here.

Binary Files as a Data Source

Spark 3.0 is bringing Binary Files as a core Data Source .

Databricks Presentation on this.

Tree Based Feature Transformation

Kafka Header from 0.11.0 is now available with Structured Streaming.

Spark with JDK 11

JDK 11 performance is slower than JDK 8

Kubernetes Related Features

● User Specific Pod Templating

Cache Data can be Analyzed

// Analyzes one column in the cached logical plan

(1) Scan parquet default.explain_temp1 [codegen id : 1]

(2) Filter [codegen id : 1]

(3) Project [codegen id : 1]

(6) HashAggregate [codegen id : 2]

(7) Filter [codegen id : 2]

(8) Project [codegen id : 2]

Dynamic Allocation without Any External

RobustScalar for Spark

RobustScaler is a kind of widely-used scaler, which use median/IQR to replace mean/std

Inspired by Sklearn [Link]

val evaluator = new MulticlassClassificationEvaluator()

Spark New Compression Code

This will lead to Faster Performance.

New Higher Order Functions

Currently its available with Prometheus.

$ bin/spark-shell --master spark://`hostname`:7077 --conf

Check all available resources here.

There is still a lot more to do, I will update that separately

CSV/JSON migrated to Datasource V2

1) Generality: support reading/writing most data management/storage systems.

Check further here

You might also like