Spark任务运行一段时间任务卡死的问题解决方法

[复制链接]
jsl
jsl   高级会员    发表于 2016-10-17 11:36:20   最新回复:2016-10-17 11:36:20

问题描述

XX局点大数据项目,有一个spark-streaming任务在运行过程中任务卡死,通过观察,任务在处理2016-06-18 10:16这个时间窗的数据的时候出现了问题。

分析日志的时候,在driver端的日志看到driver端启动了8task,有7task成功执行,另外一个task没有执行状态。Spark任务运行过程中丢失了一个task

FI的版本:FusionInsight V100R002C50SPC200

日志信息:

1、启动8task的日志:

2016-06-18 10:16,167 | INFO  | [sparkDriver-akka.actor.default-dispatcher-16] | Starting task 0.0 in stage 293749.0 (TID 1356848, scdxdsjkafka2, PROCESS_LOCAL, 1165 bytes) | org.apache.spark.Logging$class.logInfo(Logging.scala:59)

2016-06-18 10:16,167 | INFO  | [sparkDriver-akka.actor.default-dispatcher-16] | Starting task 1.0 in stage 293749.0 (TID 1356849, scdxdsjkafka1, PROCESS_LOCAL, 1165 bytes) | org.apache.spark.Logging$class.logInfo(Logging.scala:59)

2016-06-18 10:16,168 | INFO  | [sparkDriver-akka.actor.default-dispatcher-16] | Starting task 2.0 in stage 293749.0 (TID 1356850, scdxdsjkafka1, PROCESS_LOCAL, 1165 bytes) | org.apache.spark.Logging$class.logInfo(Logging.scala:59)

2016-06-18 10:16,169 | INFO  | [sparkDriver-akka.actor.default-dispatcher-16] | Starting task 3.0 in stage 293749.0 (TID 1356851, scdxdsjkafka1, PROCESS_LOCAL, 1165 bytes) | org.apache.spark.Logging$class.logInfo(Logging.scala:59)

2016-06-18 10:16,169 | INFO  | [sparkDriver-akka.actor.default-dispatcher-16] | Starting task 4.0 in stage 293749.0 (TID 1356852, scdxdsjkafka3, PROCESS_LOCAL, 1165 bytes) | org.apache.spark.Logging$class.logInfo(Logging.scala:59)

2016-06-18 10:16,169 | INFO  | [sparkDriver-akka.actor.default-dispatcher-16] | Starting task 5.0 in stage 293749.0 (TID 1356853, scdxdsjkafka2, PROCESS_LOCAL, 1165 bytes) | org.apache.spark.Logging$class.logInfo(Logging.scala:59)

2016-06-18 10:16,170 | INFO  | [sparkDriver-akka.actor.default-dispatcher-16] | Starting task 6.0 in stage 293749.0 (TID 1356854, scdxdsjkafka2, PROCESS_LOCAL, 1165 bytes) | org.apache.spark.Logging$class.logInfo(Logging.scala:59)

2016-06-18 10:16,170 | INFO  | [sparkDriver-akka.actor.default-dispatcher-16] | Starting task 7.0 in stage 293749.0 (TID 1356855, scdxdsjkafka3, PROCESS_LOCAL, 1165 bytes) |

 27task任务完成(task 1356848 丢失):

org.apache.spark.Logging$class.logInfo(Logging.scala:59)

2016-06-18 10:16,468 | INFO  | [task-result-getter-2] | Finished task 1.0 in stage 293749.0 (TID 1356849) in 301 ms on scdxdsjkafka1 (1/8) | org.apache.spark.Logging$class.logInfo(Logging.scala:59)

2016-06-18 10:16,472 | INFO  | [task-result-getter-3] | Finished task 5.0 in stage 293749.0 (TID 1356853) in 303 ms on scdxdsjkafka2 (2/8) | org.apache.spark.Logging$class.logInfo(Logging.scala:59)

2016-06-18 10:16,474 | INFO  | [task-result-getter-1] | Finished task 3.0 in stage 293749.0 (TID 1356851) in 306 ms on scdxdsjkafka1 (3/8) | org.apache.spark.Logging$class.logInfo(Logging.scala:59)

2016-06-18 10:16,480 | INFO  | [task-result-getter-0] | Finished task 7.0 in stage 293749.0 (TID 1356855) in 310 ms on scdxdsjkafka3 (4/8) | org.apache.spark.Logging$class.logInfo(Logging.scala:59)

2016-06-18 10:16,485 | INFO  | [task-result-getter-2] | Finished task 6.0 in stage 293749.0 (TID 1356854) in 315 ms on scdxdsjkafka2 (5/8) | org.apache.spark.Logging$class.logInfo(Logging.scala:59)

2016-06-18 10:16,494 | INFO  | [task-result-getter-3] | Finished task 4.0 in stage 293749.0 (TID 1356852) in 325 ms on scdxdsjkafka3 (6/8) | org.apache.spark.Logging$class.logInfo(Logging.scala:59)

2016-06-18 10:16,511 | INFO  | [task-result-getter-1] | Finished task 2.0 in stage 293749.0 (TID 1356850) in 343 ms on scdxdsjkafka1 (7/8) | org.apache.spark.Logging$class.logInfo(Logging.scala:59)

2016-06-18 10:16,027 | INFO  | [JobGenerator] | Added jobs for time 1466216220000 ms | org.apache.spark.Logging$class.logInfo(Logging.scala:59)

2016-06-18 10:16,035 | INFO  | [JobGenerator] | Added jobs for time 1466216230000 ms | org.apache.spark.Logging$class.logInfo(Logging.scala:59)

 

问题出现时的截图(一个task阻塞,阻塞6个小时):

20161017113556297001.png

 

Task统计信息中缺少一个task

http://hi3ms-image.huawei.com/hi/showimage-1430645815-408595-88bbc1a0e3abcd228e1a3bb4431202ce.jpg

 

解决方法

1、 采取推测机制:
spark-default.conf 中添加:spark.speculation true
配置关于推测机制的三个参数如下:

l  spark.speculation.interval 100:检测周期,单位毫秒;

l  spark.speculation.quantile 0.75:完成task的百分比时启动推测;

l  spark.speculation.multiplier 1.5:比其他的慢多少倍时启动推测。

 

2、设置 spark.streaming.concurrentJobs  4

 

开启推测机制后,如果集群中,某一台机器的几个task特别慢,推测机制会将任务分配到其他机器执行,最后Spark会选取最快的作为最终结果。

spark.streaming.concurrentJobs可以控制job并发度,默认是1,某一个Job卡死之后,会影响后续Job的执行。现在将其设置为4,如果某个Job卡死,不会影响后续Job的执行。

技术角度分析,设置上述两个参数之后,可以规避某个Task挂死的问题。因为此问题,现网也是运行3-4天才出现的随机问题,无法立即复现验证,目前只能采取这种规避方案。

 

 

跳转到指定楼层
快速回复 返回顶部