Spark源码系列（五）Spark Submit任务提交

最新推荐文章于 2025-07-07 22:55:56 发布

JKerving

最新推荐文章于 2025-07-07 22:55:56 发布

阅读量989

点赞数

CC 4.0 BY-SA版权

分类专栏： Spark源码文章标签： spark 大数据

本文链接：https://blog.csdn.net/JKerving/article/details/107708675

Spark源码系列：Spark Submit任务提交

文章目录

Spark源码系列：Spark Submit任务提交

前面几篇文章讲的是DAGScheduler，分析的是spark任务提交后的Stage划分。一开始没有想到按照整体任务提交的流程去写系列源码文章，所以还是写博客经验有所欠缺呀。那么从这篇文章开始我们从Spark任务提交开始，研究Spark内部是如何运行的，Spark任务是如何从开始运行到结束的。

Spark应用程序在集群上以独立的进程运行，整个的任务执行过程如下：

在这里插入图片描述

用户提交任务，初始化SparkContext对象后，SparkContext负责协调Spark任务在cluster上的运行
SparkContext需要连接到集群管理器Cluster Manager，申请资源，注册Application。在生产环境中，集群管理器通常是指Yarn。集群管理器负责在应用程序之间分配资源
连接到Cluster Manager后，根据申请到的资源，在集群中的Worker节点上创建Executor
Executor创建后，反馈信息给Driver
SparkContext初始化过程中创建并启动DAGScheduler将用户提交的任务进行Stage拆分最后转化为Task任务，完成Task任务的最佳计算位置后，将Task任务发送给指定Executor，进行任务计算执行
将Task计算结果返回Driver，Spark任务计算完毕，随后关闭Spark任务等。

前面讲了大概的Spark任务整体流程，那么下面我们将从Spark Submit开始讲起，一步步深入去看下任务提交的整体流程。

客户端任务提交

最开始自然是客户端提交用户自己编写的Spark程序，使用spark-submit脚本去提交用户的程序。

在提交Spark任务时，使用$SPARK_HOME/bin目录下的spark-submit脚本去提交。

./bin/spark-submit \
 --class <main-class> \
 --master <master-url> \
 --deploy-mode <deply-mode> \
 --conf <key>=<value> \
 ... # other options
 <application-jars> \
 [application-arguments]

–class表示任务的入口
–master表示master地址，这是集群中master的URL地址（比如说spark://10.142.97.4:7077）
–deploy-mode表示部署模式，是否将用户的Driver程序部署到集群的Worker节点，或者将本地作为外部client客户端模式。在生产环境中，我们通常选用cluster模式，并且都是用Yarn来做资源管理器
–conf表示spark配置，k-v形式
application-jar：用户程序的Jar包路径
application-arguments表示用户程序所需要的参数

举个更为实际的例子：

./bin/spark-submit \
 --class org.apache.spark.examples.SparkPi \
 --master spark://10.142.97.4:7077 \
 --deploy-mode cluster \
 --supervise \
 --executor-memory 2G \
 --total-executors-cores 5 \
 /path/examples.jar \
 <program arguments>

class指定程序入口
master指定master URL地址
deploy mode指定程序部署模式为cluster集群模式
supervise表示在程序执行失败后，重新启动application
executor-memory 2G表示每个executor的内存为2G
total-executor-cores 5表示executor的cpu总核数为5
/path/examples.jar是程序的jar包
表示程序所需要的参数

这些脚本会将这些参数代入到spark-submit脚本中去执行，具体来看一下spark/bin/spark-submit脚本内容：

if [ -z "${SPARK_HOME}" ]; then
  source "$(dirname "$0")"/find-spark-home
fi

# disable randomized hash for string in Python 3.3+
export PYTHONHASHSEED=0

exec "${SPARK_HOME}"/bin/spark-class org.apache.spark.deploy.SparkSubmit "$@"

首先检查SPARK_HOME变量是否为空。如果为空则执行then后面的程序，即执行当前目录下的find-spark-home脚本文件，设置SPARK_HOME值
脚本最后调用exec执行"${SPARK_HOME}"/bin/spark-class 调用class为：org.apache.spark.deploy.SparkSubmit，后面的"$@"是脚本执行的所有参数。实际上是调用了spark-class脚本最后进行任务的提交

继续看/spark/bin/spark-class脚本代码内容：

# -z检查设置SPARK_HOME的值
if [ -z "${SPARK_HOME}" ]; then
  source "$(dirname "$0")"/find-spark-home
fi
# 执行load-spark-env.sh脚本文件，主要目的在于加载设定一些变量值。设定spark-env.sh中的变量值到环境变量中
. "${SPARK_HOME}"/bin/load-spark-env.sh

# 检查设定java环境值
# Find the java binary
if [ -n "${JAVA_HOME}" ]; then
  RUNNER="${JAVA_HOME}/bin/java"
else
  if [ "$(command -v java)" ]; then
    RUNNER="java"
  else
    echo "JAVA_HOME is not set" >&2
    exit 1
  fi
fi
# 设置关联class文件
# Find Spark jars.
if [ -d "${SPARK_HOME}/jars" ]; then
  SPARK_JARS_DIR="${SPARK_HOME}/jars"
else
  SPARK_JARS_DIR="${SPARK_HOME}/assembly/target/scala-$SPARK_SCALA_VERSION/jars"
fi

if [ ! -d "$SPARK_JARS_DIR" ] && [ -z "$SPARK_TESTING$SPARK_SQL_TESTING" ]; then
  echo "Failed to find Spark jars directory ($SPARK_JARS_DIR)." 1>&2
  echo "You need to build Spark with the target \"package\" before running this program." 1>&2
  exit 1
else
  LAUNCH_CLASSPATH="$SPARK_JARS_DIR/*"
fi

# Add the launcher build dir to the classpath if requested.
if [ -n "$SPARK_PREPEND_CLASSES" ]; then
  LAUNCH_CLASSPATH="${SPARK_HOME}/launcher/target/scala-$SPARK_SCALA_VERSION/classes:$LAUNCH_CLASSPATH"
fi

# For tests
if [[ -n "$SPARK_TESTING" ]]; then
  unset YARN_CONF_DIR
  unset HADOOP_CONF_DIR
fi

# The launcher library will print arguments separated