所选语种没有对应资源,请选择:

本站点使用Cookies,继续浏览表示您同意我们使用Cookies。Cookies和隐私政策>

提示

尊敬的用户,您的IE浏览器版本过低,为获取更好的浏览体验,请升级您的IE浏览器。

升级
案例库

FusionInsight (V100R002C60U10) nodemanager实例故障

发布时间:  2017-09-07  |   浏览次数:  596  |   下载次数:  27  |   作者:  xWX465745  |   文档编号: EKB1000860162

目录

问题描述

YARN组件下其中一个NodeManager实例健康状态为故障,重启改实例之后依旧。



处理过程

1.收集故障NM节点日志,有如下报错打印

Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1 missing files; e.g.: /srv/BigData/tmp/yarn-nm-recovery/yarn-nm-state/004845.sst
at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.openDatabase(NMLeveldbStateStoreService.java:1017)
at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:1004)
at org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:200)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
... 5 more
2017-09-05 10:55:05,407 | INFO  | main | USER=omm OPERATION=nmShutdown TARGET=NodeManager RESULT=SUCCESS | NMAuditLogger.java:90
2017-09-05 10:55:05,408 | WARN  | main | USER=omm OPERATION=nmStartup TARGET=NodeManager RESULT=FAILURE DESCRIPTION=Exception occurred during startup | NMAuditLogger.java:288
2017-09-05 10:55:05,408 | FATAL | main | Error starting NodeManager | NodeManager.java:593
org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1 missing files; e.g.: /srv/BigData/tmp/yarn-nm-recovery/yarn-nm-state/004845.sst
at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:211)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:254)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:587)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:638)
Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1 missing files; e.g.: /srv/BigData/tmp/yarn-nm-recovery/yarn-nm-state/004845.sst
at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.openDatabase(NMLeveldbStateStoreService.java:1017)
at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:1004)
at org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:200)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
... 5 more
2017-09-05 10:55:05,445 | INFO  | pool-1-thread-1 | SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NodeManager at BDP150500L05/83.24.65.5
************************************************************/ | LogAdapter.java:45

2.确认磁盘空间利用率及权限都正常。
3.删除异常节点的下面2个目录,然后重启一下这个nm实例,恢复正常
rm -rf /srv/BigData/tmp/nm
rm -rf /srv/BigData/tmp/yarn-nm-recovery

根因

yarn组件的参数yarn.resourcemanager.recovery.enabled 用于设置是否让ResourceManager在启动后恢复状态,默认是开启的。所以在yarn服务重启后会尝试恢复container的状态。

由于某种原因,对应的文件丢失,导致无法启动。

解决方案

删除异常节点的下面2个目录,然后重启一下这个nm实例。
rm -rf /srv/BigData/tmp/nm
rm -rf /srv/BigData/tmp/yarn-nm-recovery