FusionInsight平台C60版本节点SSH互信异常

发布时间:  2017-05-02 浏览次数:  233 下载次数:  0
问题描述

现场测试环境重启集群发现一台控制\数据节点所有服务均无法启动,节点操作系统登录正常。

处理过程
表现节点中所有服务不可用,怀疑为该节点网络环境问题导致。

1、登录节点,查看nodeagent日志/var/log/Bigdata/nodeagent/agentlog,查看心跳检测日志heartbeat_trace.log有如下打印

2017-04-30 09:33:12,567 DEBUG [Thread-577] Send heartbeat request to Controller. com.huawei.bigdata.om.agent.services.CommunicationService$HeartbeatSender.run(CommunicationService.java:198)
2017-04-30 09:33:15,987 INFO  [NetworkMonitorThread] [--- 192.168.22.123 ping statistics ---, 10 packets transmitted, 10 received, 0% packet loss, time 4506ms, rtt min/avg/max/mdev = 0.105/0.178/0.423/0.092 ms] com.huawei.bigdata.om.agent.services.CommunicationService$NetworkMonitorThread.run(CommunicationService.java:313)
2017-04-30 09:33:52,581 ERROR [Thread-577] Continuing heartbeat sending, even Exception occured in the HeartbeatSender thread. com.huawei.bigdata.om.agent.services.CommunicationService$HeartbeatSender.run(CommunicationService.java:227)
java.lang.reflect.UndeclaredThrowableException
        at com.sun.proxy.$Proxy10.nodeHeartbeat(Unknown Source)
        at com.huawei.bigdata.om.agent.services.CommunicationService$HeartbeatSender.run(CommunicationService.java:199)
Caused by: java.net.ConnectException: Call From tac3/192.168.22.122 to 192.168.22.123:20025 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
        at sun.reflect.GeneratedConstructorAccessor124.newInstance(Unknown Source)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:801)
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:732)
        at org.apache.hadoop.ipc.Client.call(Client.java:1515)
        at org.apache.hadoop.ipc.Client.call(Client.java:1447)
        at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:242)
        ... 2 more
Caused by: java.net.ConnectException: Connection refused
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
        at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
        at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
        at java.net.Socket.connect(Socket.java:589)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
        at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:649)
        at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:747)
        at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:394)
        at org.apache.hadoop.ipc.Client.getConnection(Client.java:1564)
        at org.apache.hadoop.ipc.Client.call(Client.java:1486)
        ... 4 more
2017-04-30 09:33:52,581 DEBUG [Thread-577] Send heartbeat request to Controller. com.huawei.bigdata.om.agent.services.CommunicationService$HeartbeatSender.run(CommunicationService.java:198)

其中的192.168.22.123为OMS浮动IP地址,报错为同OMS心跳连接中断。

2、尝试在问题主机中PING OMS浮动IP和主备OMS节点IP可达。

3、测试omm用户互信,通过问题节点的omm用户SSH到主OMS节点,可达。但反向测试发现需要输入omm密码,为互信失败导致的问题。

[omm@tac3 ~]$ ssh tac1
Warning: Permanently added 'tac1,192.168.22.120' (RSA) to the list of known hosts.
Last login: Fri Apr 28 09:09:04 2017 from 192.168.22.122
[omm@tac1 ~]$ ssh tac3
Warning: Permanently added 'tac3,192.168.22.122' (RSA) to the list of known hosts.
omm@tac3's password:

4、沟通现场工程师存在测试中修改ssh配置文件的情况,排查发现为 /etc/ssh/sshd_config文件中选项PubkeyAuthentication被手动修改为no,即不允许公钥认证方式登录,导致omm用户无法通过公钥认证因此造成了无法SSH的情况。

5、同正常节点对比该配置默认为注释掉的,手动修改后重启sshd服务,问题解决。

[root@tac3 agentlog]# cat /etc/ssh/sshd_config |grep PubkeyAuthentication
#PubkeyAuthentication yes
[root@tac3 agentlog]# /etc/init.d/sshd restart
Stopping sshd:                                             [  OK  ]
Starting sshd:                                             [  OK  ]

END