Spark启动参数以及调优记录

# Spark启动命令参数详细介绍

[toc]

spark-submit \
--master yarn \
--deploy-mode client \
--class com.application.AttributesCreateLabelsApplication \
--jars $(echo /data/flink-job/jast-test/lib/*.jar | tr ' ' ',') \
--num-executors 2 \
--executor-cores 2 \
--executor-memory 4g \
--driver-memory 2g \
--files ./env.properties \
--conf spark.kryoserializer.buffer.max=2000 \
--conf spark.rpc.message.maxSize=500 \
--conf spark.sql.shuffle.partitions=100 \
--conf spark.default.parallelism=100 \
--conf spark.storage.memoryFraction=0.3 \
--conf spark.shuffle.memoryFraction=0.7 \
--conf spark.shuffle.safetyFraction=0.8 \
--conf spark.yarn.maxAppAttempts=5 \ 
--conf spark.yarn.am.attemptFailuresValidityInterval=1h \    
--conf spark.yarn.max.executor.failures={8 * num_executors} \    
--conf spark.yarn.executor.failuresValidityInterval=1h \    
--conf spark.task.maxFailures=8 \
--conf spark.shuffle.spill=true \
--conf spark.streaming.kafka.maxRatePerPartition=10000 \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.executor.extraJavaOptions="-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:-UseGCOverheadLimit"  \
/data/flink-job/jast-test/offline-1.0-SNAPSHOT.jar

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

# 参数

# num-executors

该作业总共需要多少executor进程执行

建议：每个作业运行一般设置50-~100个左右较合适

# executor-memory

设置每个executor进程的内存， num-executors*executor-memory代表作业申请的总内存量（尽量不要超过最大总内存的1/3~1/2）

建议：设置4G~8G较合适

# executor-cores

每个executor进程的CPU Core数量，该参数决定每个

executor进程并行执行task线程的能力， num-executors* executor-cores代表

作业申请总CPU core数（不要超过总CPU Core的1/3~1/2 ）

建议：设置2~4个较合适

# driver-memory(AM)

设置Driver进程的内存

建议：通常不用设置，一般1G就够了，若出现使用collect算子将RDD数据全部拉取到Driver上处理，就必须确保该值足够大，否则OOM内存溢出
dump到本地时候会用到这个内存，存储到hdfs上不需要使用这个，一般不用修改

# spark.default.parallelism

每个stage的默认task数量

建议：设置500~1000较合适，默认一个HDFS的block对应一个task，Spark默认值偏少，这样导致不能充分利用资源

# spark.storage.memoryFraction

设置RDD持久化数据在executor内存中能占的比例，

默认0.6，即默认executor 60%的内存可以保存持久化RDD数据

建议：若有较多的持久化操作，可以设置高些，超出内存的会频繁gc导致运行缓慢

# spark.shuffle.memoryFraction

聚合操作占executor内存的比例，默认0.2

建议：若持久化操作较少，但shuffle较多时，可以降低持久化内存占比，提高shuffle操作内存占比

# total-executor-cores

这个参数是所有的executor使用的总CPU核数。[ standalone default all cores ]

注：[ standalone default all cores ] 所以standalone模式下可以不设置num-executors，num-executors=total-executor-cores/executor-cores，并且total-executor-cores必须是executor-cores的整数倍，否则启动的num-executors向下取整

# 异常重试设置

参考：

https://blog.csdn.net/weixin_36378951/article/details/112199312

# spark.yarn.maxAppAttempts

--conf spark.yarn.maxAppAttempts

**注：**如果spark.yarn.maxAppAttempts 的值大于yarn.resourcemanager.am.max-attempts，并不会生效，小于是可以的。

spark.yarn.maxAppAttempts属性在提交程序时限制其重试次数，如： spark-submit --conf spark.yarn.maxAppAttempts=1

因为我们使用的是Spark on Yarn，虽然由Yarn负责启动和管理AM以及分配资源，但是Spark有自己的AM实现，当Executor运行起来后，任务的控制是由Driver负责的。而重试上，Yarn只负责AM的重试。Executor重试见下面介绍。

# spark.yarn.max.executor.failures

在Spark对ApplicationMaster的实现里，Spark提供了参数 spark.yarn.max.executor.failures 来控制Executor的失败次数，当Executor的失败次数达到这个值的时候，整个Spark应该程序就失败了

以上相关Spark属性的定义如下:

属性名	默认值	解释
spark.yarn.maxAppAttempts	YARN配置中的yarn.resourcemanager.am.max-attempts属性的值	提交申请的最大尝试次数, 小于等于YARN配置中的全局最大尝试次数。
spark.yarn.max.executor.failures	numExecutors * 2, with minimum of 3 ，即 max(2 * num executors，3)	应用程序失败之前的最大执行程序失败次数。

在YARN配置中，我们可以看到：

属性名	默认值	解释
yarn.resourcemanager.am.max-attempts	2	最大应用程序尝试次数。它是所有AM的全局设置。每个应用程序主机都可以通过API指定其各自的最大应用程序尝试次数，但是单个数字不能超过全局上限。如果是，资源管理器将覆盖它。默认数量设置为2，以允许至少一次重试AM.

所以默认情况下，spark.yarn.maxAppAttempts的值为2，如果想不进行第二次重试，可以将改值设为1(注意，0值是无效的，至少为提交一次)

# spark.yarn.am.attemptFailuresValidityInterval 与 spark.yarn.executor.failuresValidityInterval

如果应用程序运行数天或数周，而不重新启动或重新部署在高度使用的群集上，则可能在几个小时内耗尽尝试次数。为了避免这种情况，尝试计数器应该在每个小时都重置。

新增参数--conf spark.yarn.am.attemptFailuresValidityInterval=1h

# 配置建议

修改yarn配置yarn.resourcemanager.am.max-attempts为5
启动命令如下

spark-submit --master yarn --deploy-mode cluster \    
--conf spark.yarn.maxAppAttempts=5 \ 
--conf spark.yarn.am.attemptFailuresValidityInterval=1h \    
--conf spark.yarn.max.executor.failures={8 * num_executors} \    
--conf spark.yarn.executor.failuresValidityInterval=1h \    
--conf spark.task.maxFailures=8

1
2
3
4
5
6

上次更新: 2023/05/11, 16:21:56

← Spark算子 Spark-shell读取MySQL写入HDFS→