0% found this document useful (0 votes)
111 views

Spark Runtime Architecture Overview

Uploaded by

kolodacool
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
111 views

Spark Runtime Architecture Overview

Uploaded by

kolodacool
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

BIG DATA (HTTP://TIMEPASSTECHIES.

COM/)

spark
Home runtime architecture
(http://timepasstechies.com) overview – tutorial 13
 November, 2017  adarsh (http://timepasstechies.com/author/adarshgorur/) 
Map Reduce (http://timepasstechies.com/category/programming/data-
Leave a comment (http://timepasstechies.com/spark-runtime-architecture-
analytics/mapreduce/)
overview/#respond)

Spark (http://timepasstechies.com/category/programming/data-analytics/spark/)
In distributed mode, Spark uses a master/slave architecture with one central coordinator
and(http://timepasstechies.com/category/programming/data-analytics/hive/)
Hive many distributed workers. The central coordinator is called the driver.The driver
communicates with a potentially large number of distributed workers called executors.
The &
Hdfs driver
Yarnruns in its own Java process and each executor is a separate Java process. A
(http://timepasstechies.com/category/programming/data-analytics/hdfs/)
driver and its executors are together termed a Spark application.A Spark application is
Piglaunched
(http://timepasstechies.com/category/programming/data-analytics/pig/)
on a set of machines using an external service called a cluster manager.

Oozie (http://timepasstechies.com/category/programming/data-analytics/oozie/)
Driver
Hbase (http://timepasstechies.com/category/programming/data-analytics/hbase/)
The driver is the process where the main() method of your program runs. It is the process
running the user(http://timepasstechies.com/category/programming/design-patterns/)
code that creates a SparkContext, creates RDDs, and performs
Design Patterns
transformations and actions. The Spark driver is responsible for converting a user
program (http://timepasstechies.com/category/stream-processing/)
streaming into units of physical execution called tasks.

A Spark program implicitly creates a logical directed acyclic graph (DAG) of operations.
When the driver runs, it converts this logical graph into a physical execution plan. Spark
Posts
performs several optimizations, such as pipelining map transformations together to
merge them, and converts the execution graph into a set of stages. Each stage, in turn,
consists of multiple tasks.

Given a physical execution plan, a Spark driver must coordinate the scheduling of
individual tasks on executors. When executors are started they register themselves with
the driver, so it has a complete view of the application’s executors at all times. Each
executor represents a process capable of running tasks and storing RDD data.The Spark
driver will look at the current set of executors and try to schedule each task in an
appropriate location, based on data placement.

Executors
Spark executors are worker processes responsible for running the individual tasks in a
given Spark job.Executors have two roles. First, they run the tasks that make up the
application and return results to the driver. Second, they provide in-memory storage for

/
RDDs that are cached by user programs, through a service called the Block Manager that
lives within each executor. Because RDDs are cached directly inside of executors, tasks
can run alongside the cached data.

Cluster Manager
Spark depends on a cluster manager to launch executors and, in certain cases, to launch
the driver. The cluster manager is a pluggable component in Spark. This allows Spark to
run on top of different external managers, such as YARN and Mesos, as well as its built-in
Standalone cluster manager.

let’s walk through the exact steps that occur when you run a Spark application on a
cluster

1. The user submits an application using spark-submit.

2. spark-submit launches the driver program and invokes the main() method specified by
the user.

3. The driver program contacts the cluster manager to ask for resources to launch
executors.

4. The cluster manager launches executors on behalf of the driver program.

5. The driver process runs through the user application. Based on the RDD actions and
transformations in the program, the driver sends work to executors in the form of tasks.

6. Tasks are run on executor processes to compute and save results.

7. If the driver’s main() method exits or it calls SparkContext.stop(), it will terminate the
executors and release resources from the cluster manager.

spark-submit
Spark provides a single tool for submitting jobs across all cluster managers, called spark-
submit.

Below is an example

1. spark-submit --master spark://hostname:7077 --deploy-mode cluster --class


com.consumer.SparkDstreamConsumer --name "Call Quality Test" --total-executor-cores 5 --executor-memory 5g
/
spark-0.0.1-SNAPSHOT.jar

When spark-submit is called with nothing but the name of a script or JAR, it simply runs
the supplied Spark program locally. The –master flag specifies a cluster URL to connect
to; in this case, the spark:// URL means a cluster using Spark’s Standalone mode. Below is
the possible values

(http://timepasstechies.com/spark-runtime-architecture-overview/master_flag/)

Common Flags used

(http://timepasstechies.com/spark-runtime-architecture-overview/commonflags/)

(http://timepasstechies.com/spark-runtime-architecture-overview/commonflags2/)

SHARE THIS:

 (http://timepasstechies.com/spark-runtime-architecture-overview/?share=twitter&nb=1)

 (http://timepasstechies.com/spark-runtime-architecture-overview/?share=facebook&nb=1)

 (http://timepasstechies.com/spark-runtime-architecture-overview/?share=google-plus-1&nb=1)

/
RELATED

spark performance tuning and spark streaming example and checkpointing and fault tolerance
optimization - tutorial 14 architecture in spark streaming
(http://timepasstechies.com/spark (http://timepasstechies.com/spark (http://timepasstechies.com/chec
-performance-tuning- -streaming-example- kpointing-fault-tolerance-spark-
optimization/) architecture/) streaming/)
November, 2017 November, 2017 November, 2017
In "Data Analytics" In "Data Analytics" In "Data Analytics"

Posted in: Data Analytics (http://timepasstechies.com/category/programming/data-analytics/), Spark


(http://timepasstechies.com/category/programming/data-analytics/spark/)
Filed under: Spark Rdd (http://timepasstechies.com/tag/spark-rdd/)

← spark numeric rdd functions and spark performance tuning and optimization
examples – tutorial 12 – tutorial 14 →
(http://timepasstechies.com/spark-numeric- (http://timepasstechies.com/spark-
rdd-functions-examples/) performance-tuning-optimization/)

LEAVE A REPLY
Your email address will not be published. Required fields are marked *

COMMENT

NAME *

EMAIL *

WEBSITE

/
NOTIFY ME OF FOLLOW-UP COMMENTS BY EMAIL.
NOTIFY ME OF NEW POSTS BY EMAIL.

POST COMMENT

Search …

RECENT POSTS

aws s3 downloading a folder (http://timepasstechies.com/aws-s3-download-a-folder/)

using regex in spark dataframe (http://timepasstechies.com/using-regex-in-spark-dataframe/)

running spark job using the mesosphere rest api (http://timepasstechies.com/running-spark-job-


using-the-mesosphere-rest-api/)

HOME (HTTP://TIMEPASSTECHIES.COM) CONTACT ME


(HTTP://TIMEPASSTECHIES.COM/CONTACT/) ABOUT ME
(HTTP://TIMEPASSTECHIES.COM/ABOUT/)

Copyright © 2017 Time Pass Techies

You might also like