Spark 3.0 New Features: Spark With GPU Support
Spark 3.0 New Features: Spark With GPU Support
0 New Features
Spark 3.0 is coming with several important features, but here I list very few notable
features only.
This project is inherited from morpheus & will fully support DataFrame.
Now Graph query will have its own Catalysts & it will follow similar principles as
SparkSQL.
val df = spark.read.format(BINARY_FILE).load(dir.getPath)
Binary Format doesn’t support Write Operation & Currently it supports less than 2 GB
Data file size
[email protected]
Dynamic Partition Pruning
http://www.herbrich.me/papers/adclicksfacebook.pdf
XGBoosting
LightGBM
Catboost
[email protected]
Kafka Header Support in Structured
Streaming
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1")
.option("includeHeaders", "true")
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)", "headers")
.as[(String, String, Map)]
Spark now fully supports JDK 11, considering end of Life of JDK 8 and JDK 9/10 is as
well not that active.
Following are few notable features( I found them interesting ) implemented in Spark
3.0 -
[email protected]
● Latest Version of Kubernetes is now Supported
● spark.kubernetes.pyspark.pythonVersion can now actually be passed to
Executor & python3 is now default
● Improvement on Dynamic Allocation with Kubernetes
● Daemon Thread Blocking JVM Exit automatically because of Kubernetes
Client bug
● Config Requests for Driver pod
● Kubernetes support for GPU-aware scheduling
● RegisteredExecutors reload supported after ExternalShuffleService restart
● Configurable auth secret source in k8s backend
● Support automatic spark.authenticate secret in Kubernetes backend
● Subpath Mounting in Kubernetes
● Kerberos Support in Kubernetes resource manager
● Supporting emptyDir Volume/tmpfs
● More mature spark-submit with K8s
sql(
"""CACHE TABLE cachedQuery AS
| SELECT c0, avg(c1) AS v1, avg(c2) AS v2
| FROM (SELECT id % 3 AS c0, id % 5 AS c1, 2 AS c2 FROM range(1, 30))
| GROUP BY c0
""".stripMargin)
[email protected]
Spark 3.0 has now Improved More readable
Explain
Spark Explain for physical Plan was very complex to understand, but now Explain is
much more readable.
Old Format:
*(2) Project [key#2, max(val)#15]
+- *(2) Filter (isnotnull(max(val#3)#18) AND (max(val#3)#18 > 0))
+- *(2) HashAggregate(keys=[key#2], functions=[max(val#3)], output=[key#2, max(val)#15,
max(val#3)#18])
+- Exchange hashpartitioning(key#2, 200)
+- *(1) HashAggregate(keys=[key#2], functions=[partial_max(val#3)], output=[key#2, max#21])
+- *(1) Project [key#2, val#3]
+- *(1) Filter (isnotnull(key#2) AND (key#2 > 0))
+- *(1) FileScan parquet default.explain_temp1[key#2,val#3] Batched: true, DataFilters:
[isnotnull(key#2), (key#2 > 0)], Format: Parquet, Location:
InMemoryFileIndex[file:/user/hive/warehouse/explain_temp1], PartitionFilters: [], PushedFilters:
[IsNotNull(key), GreaterThan(key,0)], ReadSchema: struct<key:int,val:int>
New Format:
Project (8)
+- Filter (7)
+- HashAggregate (6)
+- Exchange (5)
+- HashAggregate (4)
+- Project (3)
+- Filter (2)
+- Scan parquet default.explain_temp1 (1)
[email protected]
(4) HashAggregate [codegen id : 1]
Input: [key#2, val#3]
(5) Exchange
Input: [key#2, max#11]
Dynamic Allocation is now Possible without any External Shuffle Service & that
brings Dynamic Allocation in K8s .
Currently with Spark 3.0, Dynamic Allocation is not fully available with Kuberenetes.
[email protected]
Logistic Loss is now supported in Spark
Logistic Loss, a widely known Metrics for Classification Task is now available.
val df = sc.parallelize(labels.zip(probabilities)).map {
case (label, probability) =>
val prediction = probability.argmax.toDouble
(prediction, label, probability)
}.toDF("prediction", "label", "probability")
[email protected]
Executor Memory Metrics are now Available
Executor Metrics information will help provide insight into how executor and driver JVM
memory is used, and for the different memory regions. It can be used to help determine
good values for spark.executor.memory, spark.driver.memory, spark.memory.fraction,
and spark.memory.storageFraction.
2) Flexibility: customize and optimize the read and write paths for different systems
based on their capabilities.