Spark Concept
Spark Concept
What is a cluster?
What is a container?
..
Application Master Container
1. PySpark Application
2. Scala Application
PySpark Application
The PySpark main method is the PySpark Driver, and the JVM
application is Application Driver.
The Application driver distributes the work to others. So, the driver
does not perform any data processing work. Instead, it will create
some executors and get the work-done from them.
After starting, the driver will go-back to the YARN RM and ask for
some more containers. The RM will create some more containers on
worker nodes and give them to the driver.
Application Driver Execution
Now the driver will start spark executor in these containers. Each
container will run one Spark executor, and the Spark executor is a
JVM application.
PySpark code translated into Java code, and runs in the JVM. But if
you are using some Python libraries which doesn’t have a Java
wrapper, you will need a Python runtime environment to run them.
So, the executors will create a Python runtime environment so they
can execute your Python code.
The Spark Submit allows you to submit the spark application to the
cluster, and you can apply to run in one of the two modes.
1. Cluster Mode
2. Client Mode
In the cluster mode, Spark Submit will reach the YARN RM,
requesting it to start the driver in an AM container. YARN will start
your driver in the AM container on a worker node in the cluster.
1. The cluster mode allows you to submit the application and log off
from the client machine, as the driver and executors run on the
cluster. They have nothing active on your client’s machine. So,
even if you log off from your client machine, the driver and
executor will continue to run in the cluster.
4. Then spark will break the logical plan at the end of every wide
dependency and create two or more stages.
The task is the most critical concept for a Spark job and is the
smallest unit of work in a Spark job. The Spark driver assigns these
tasks to the executors and asks them to do the work.
2. Data Partition
So, the driver is responsible for assigning a task to the executor. The
executor will ask for the code or API to be executed for the task. It
will also ask for the data frame partition on which to execute the
given code. The application driver facilitates both these things for
the executor, and the executor performs the task.
Now, let’s assume I have a driver and four executors. Each executor
will have one JVM process. But I assigned 4 CPU cores to each
executor. So, my Executor JVM can create four parallel threads and
that’s the slot capacity of my executor.
So, each executor can have four parallel threads, and we call them
executor slots. The driver knows how many slots we have at each
executor and it is going to assign tasks to fit in the executor slots.
The last stage will send the result back to the driver over the
network. The driver will collect data from all the tasks and present it
to you.
Apache Spark gives you two prominent interfaces to work with data.
1. Spark SQL
2. Dataframe API
You may have Dataframe APIs, or you may have SQL both will go to
the Spark SQL engine.
For Spark, they are nothing but a Spark Job represented as a logical
plan.
The Spark SQL Engine will process your logical plan in four stages.
Catalyst Optimization
1. The Analysis stage will parse your code for errors and incorrect
names. The Analysis phase will parse your code and create a fully
resolved logical plan. Your code is valid if it passes the Analysis
phase.
2. The Logical Optimization phase applies standard rule-based
optimizations to the logic plan.
3. Spark SQL takes a logical plan and generates one or more in the
Physical Planning phase. Physical planning phase considers cost
based optimization. So the engine will create multiple plans to
calculate each plan’s cost and select the one with the low cost. At
this stage the engine use different join algorithms to generate
more than one physical plan.
4. The last stage is Code Generation. So, your best physical plan
goes into code generation, the engine will generate Java byte
code for the RDD operations, and that’s why Spark is also said to
act as a compiler as it uses state of the art compiler technology
for code generation to accelerate execution.
1. spark.driver.memory
2. spark.driver.memoryOverhead
So, let’s assume you asked for the spark.driver.memory as 1GB and
the default value of spark.driver.memoryOverhead as 0.10
The YARN RM will allocate 1 GB memory for the driver JVM, and
10% of requested memory or 384 MB, whatever is higher for
container overhead.
So, the driver will again request the executor containers from the
YARN. The YARN RM will allocate a bunch of executor containers.
1. Overhead Memory
2. Heap Memory
4. PySpark Memory
So, a Spark driver will ask for executor container memory using four
configurations.
So, the driver will look at all these configurations to calculate your
memory requirement and sum it up.
While using YARN RM, you should look for the following
configurations.
1. yarn.scheduler.maximum-allocation-mb
2. yarn.nodemanager.resource.memory-mb
You do not need to worry about PySpark memory if you write your
Spark application in Java or Scala. But if you are using PySpark, this
question becomes critical.
1. You have a container, and the container has got some memory.
Let’s focus on the JVM memory in this part. The heap memory is
further broken down into three parts.
1. Reserved Memory
3. User Memory
So, let’s assume I got 8 GB for the JVM heap. This 8 GB is divided
into three parts. Spark will reserve 300 MB for itself. That’s fixed,
and the Spark engine itself uses it.
The next part is the Spark executor memory pool, controlled by the
spark.memory.fraction configuration, and the default value is 60%.
So, for example, the spark memory pool translates to 4620 MB.
1. The Reserved Pool is gone for the Spark engine itself. You cannot
use it.
The Spark memory pool is where all your data frames and data
frame operations live. You can increase it from 60% to 70% or even
more if you are not using UDFs, custom data structures, and RDD
operations. But you cannot make it zero or reduce it too much
because you will need it for metadata and other internal things.
Spark Executor Memory Pool is further broken down into two sub
pools.
• Storage Memory
• Executor Memory
The default break up for each sub pool is 50% each, but you can
change it using the spark.memory.storageFraction configuration.
We use the Storage Pool for caching data frames and the Executor
Pool is to perform data frame computations.
3. The User Memory Pool is used for non data frame operations.
• All the RDD information and the RDD operations are performed
in user memory.
But if you are using Data Frame operations, they do not use
the user memory even if the data frame is internally translated
and compiled into RDD. You will be using user memory only if
you apply RDD operations directly in your code.