When this problem happens, loading other nodes with MN node fails. Normal time spent on loading a node is 30 minutes, if it is not completed in 30 minutes, there must be problem. Pick a node at this time, and do network cable direct connection with BMC network port of this server through PC. Logs on with default BMC IP, the loaded node stop on process of get DHCP address. As node cannot get IP address, this node is in constant reboot status.
For first and second cause, do corresponding check and recovery operation can solve this problem. For third situation, there are two method below to solve:
1. Default lease file on MN is released after 12 hours, and time to release can be seen in this file. Wait 12 hours would release all occupied IP automatically. But, precondition is that servers of cluster that occupy these IP must be powered off, in case lease renewal again.
2. To solve this problem as soon as possible, the method below can be used:
First step: backup of /var/lib/dhcp/db/dhcpd.releases file, in case this problem is not solved and goes back, recover these files;
Second step: execute command of /var/lib/dhcp/db/dhcpd.releases in MN, delete lease file
Third step: execute command of service dhcpd restart at MN, start dhcp service
Fourth step: execute command of makedhcp –a at MN, regenerate dhcp lease file.
Loaded node cannot get MN node assigned IP address of DHCP, there are three main causes:
1. Network between MN node and loaded node is disconnected
2. DHCP server of MN node is manually closed.
3. IP address of DHCP address pool of MN node exhaust illegally.
For first situation: plug network cable of loaded node eth0 network port connected to switcher, connect PC to that port through network cable. Configure IP address of same network segment of switcher, ping management IP of MN in PC. If management IP can ping through, it means the network is connected. Otherwise, network configuration is faulty, check corresponding configuration of switcher.
For second situation: log on MN node with puty, execute command of service dhcpd status at MN node. If dhcp is shutdown, execute command of service dhcpd start to start dhcp service. There is monitor software supervising all process in MN node, if dhcp service is shutdown, monitor software process is shutdown, it need below command of sh /opt/galax/gcs/watchdog/watchdog.sh –start to start monitor software, after startup succeed, monitor software would supervise dhcp service process, and pull up automatically once shutdown.
For third situation: check dhcp lease file on MN. Execute command of cd /var/lib/dhcp/db/ in MN node, there is file of dhcpd.releases in this directory, as in the figure below:
check content of dpcpd.releases, find lines begin with lease
If IP of dhcp address pool is occupied, in above screen, binding state is active status. Explorer whole lease file, if all IP of address pool is occupied, loaded node cannot get dhcp.
When field engineer build GalaX environment, when all network is connected, close all node to be loaded, the power loaded node on in batches, do power on all nodes. The number of nodes MN can load simultaneously is half of DHCP address pool configured in installation configured system. The number of nodes loaded simultaneously must be less than that value, keep other node not being loaded power off, than there would not MN node DHCP address pool exhaust problem.