1. In the VMSUSER, the virtual machine’s information is queried from the “eucalyptus” database of the ESC node, however, in the OMS Portal, the virtual machine’s status information of in the CNA node displays after it has been reported to the ESC from each cluster’s CRM node. From the symptom of the fault, there is virtual machine information in the ESC node database, but the information reported by the cluster CRM hasn’t been received by the ESC, the possible reason are as followed:
(1) The CRM node’s process has problem, it hasn’t reported the virtual machine’s status to the ESC node;
(2) The ESC node’s process has problem, it hasn’t received the virtual machine’s information reported by the CRM node;
(3) The network has problem, which leads to the information hasn’t been transmitted normally.
2. Check the CRM node’s process, we find its status is normal.
3. Check the ESC node’s process, its status is abnormal.
Restart up the ESC node’s process, check its status, we find it change from normal to abnormal fast:
4. After the researcher analyzing the ESC logs deeply, they find that the reason of the CLC process is abnormal is that the Watchdog making the CLC process to restart repeatedly, once the Watchdog find the CLC process has only one of the following three conditions, it will consider the CLC process is abnormal, and then make the process restarting:
(1) The number of TCP connections at the 8773 port is 0;
(2) There exists the “/opt/eucalyptus/restartFlag” file;
(3) It doesn’t have the “eucalyptus” certification.
5. Check the ESC node, we find there is the “restartFlag” file in the “/opt/eucalyptus/” catalog, while the CLC process starting up, there will start some startup items, if one of these startup items hasn’t started up successfully, there will produce the “/opt/eucalyptus/restartFlag” flag file, before the CLC process starting, the Watchdog will delete the “/opt/eucalyptus/restartFlag” flag file. However, because there is software bug in the R002C00SPC100 which causing to delete the flag file “/opt/flag” in the version of R001C01, then the Watchdog has deleted the restart flag file, so each time starts the CLC process, it will be see as abnormal, and then the CLC process has been restarted repeatedly.