No relevant resource is found in the selected language.

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy>Search

Reminder

To have a better experience, please upgrade your IE browser.

upgrade

HUAWEI CLOUD Stack 6.5.0 Alarm and Event Reference 04

Rate and give feedback:
Huawei uses machine translation combined with human proofreading to translate this document to different languages in order to help you better understand the content of this document. Note: Even the most advanced machine translation cannot match the quality of professional translators. Huawei shall not bear any responsibility for translation accuracy and it is recommended that you refer to the English document (a link for which has been provided).
ALM-13 Pod Is Abnormal

ALM-13 Pod Is Abnormal

Description

This alarm is generated when health check for a pod fails.

Pod: In Kubernetes, pods are the smallest unit of creation, scheduling, and deployment. A pod is a group of relatively tightly coupled containers. Pods are always co-located and run in a shared application context. Containers within a pod share a namespace, IP address, port space, and volume.

Attribute

Alarm ID

Alarm Severity

Alarm Type

13

Minor

Environmental alarm

Parameters

Parameter Name

Parameter Description

kind

Resource type.

namespace

Name of the project to which the resource belongs.

name

Resource name.

uid

Unique ID of the resource.

OriginalEventTime

Event generation time.

EventSource

Name of the component that reports an event.

EventMessage

Supplementary information about an event.

Impact on the System

  • If the health check mode is set to Readiness, the system is not affected.
  • If the health check mode is set to Liveness and the number of check failures exceeds the threshold, the container or process is deleted and instances are re-created. As a result, services related to the application may be abnormal.

System Actions

  • The system keeps executing periodic health checks.
  • If the health check mode is set to Liveness and the number of check failures exceeds the threshold, the container or process is deleted and instances are re-created.

Possible Causes

  • The health check mode is configured wrong.
  • Health check fails as the docker of the node where the application resides is suspended.
  • Health check fails as the node is overloaded.

Procedure

Figure 17-1 shows the alarm handling procedure.

Figure 17-1 Alarm handling flowchart
  1. Obtain the name of the instance that is abnormal.

    1. Use a browser to log in to the FusionStage OM zone console.
      1. Log in to ManageOne Maintenance Portal.
        • Login address: https://Address for accessing the homepage of ManageOne Maintenance Portal:31943, for example, https://oc.type.com:31943.
        • The default username is admin, and the default password is Huawei12#$.
      2. On the O&M Maps page, click the FusionStage link under Quick Links to go to the FusionStage OM zone console.
    2. Choose Application Operations > Application Operations from the main menu.
    3. In the navigation pane on the left, choose Alarm Center > Alarm List and query the alarm by setting query criteria.
    4. Click to expand the alarm information. Record the values of name and namespace in Location Info, that is, podname and namespace.

  2. Use PuTTY to log in to the manage_lb1_ip node.

    The default username is paas, and the default password is QAZ2wsx@123!.

  3. Run the following command and enter the password of the root user to switch to the root user:

    su - root

    Default password: QAZ2wsx@123!

  4. Determine the type of application and its health check mode.

    1. Run the following command to obtain the pod template:

      kubectl get pod podname -n namespace -oyaml

      In the preceding command, podname is the instance name obtained in 1, and namespace is the namespace obtained in 1.

      The contents between ellipses (...) in the command output are part of the pod template.If the spec field in the pod template contains the keyword containers, the containerized application is used. Otherwise, the process application is used.
      ...
      spec:
        containers:
        - image: */nginx:latest
      ...
    2. If the pod template contains the following information, the health check mode of the application is set to Liveness. If the number of health check failures exceeds the threshold, the container or process is deleted and then re-created.
      ...
      livenessProbe:
            failureThreshold: 3
            httpGet:
              path: /healthz
              port: 10253
              scheme: HTTP
            initialDelaySeconds: 15
      ...
    3. If the pod template contains the following information, the health check mode of the application is set to Readiness. No operation is executed if the number of health check failures exceeds the threshold.
      ...
      readinessProbe:
            failureThreshold: 3
            httpGet:
              path: /healthz
              port: 10253
              scheme: HTTP
            initialDelaySeconds: 15
      ...

  1. Run the following command to obtain the IP address of the node on which the pod runs:

    kubectl get pod podname -n namespace -oyaml | grep -i hostip:

    Log in to the node as the paas user.

  2. Check whether the application process is abnormal.

    Run the ps -elf | grep appname command to check whether the process exists or whether the service process is suspended or becomes a zombie process.

    appname in the command indicates the name of an application related to the abnormal pod.If the following information is displayed, the health check fails due to that the Nginx process is suspended:
    hostname:~ # ps -elf | grep nginx
    4 T root      31032  31016  0  80   0 -  8104 signal Jan11 ?        00:00:00 nginx: master process nginx -g daemon off;
    • If yes, go to 7.
    • If no, go to 8.

  3. Rectify the abnormal application process.

    1. Restart the application process if it does not exist. If the application process is suspended, run the kill -18 processid command to restore the process.
    2. If the application process is in the zombie (Z) state, run the kill -9 parentid command to stop its parent process. If the parent process is a system process, reboot the node.
    3. Check whether the alarm is cleared after the application process is restarted successfully.

  4. Check whether the health check configuration is correct.

    1. Go to the /var/paas/sys/log/kubernetes/ directory and search for error information in the kubelet.log file based on the instance name.
    2. Check the health check configuration according to the error information. If the following information is displayed, the health check script does not exist. In this case, correct the health check configuration for the application and check whether the alarm is cleared.
      hostname:~ # cd /var/paas/sys/log/kubernetes/
      hostname:/var/paas/sys/log/kubernetes # vi kubelet.log 
      I0113 16:53:03.756007   70092 prober.go:110] Liveness probe for "hello-component-3ab4e19a-2777348854-mxt6h_default(1caf348a-f83f-11e7-aa58-286ed488d1d4):hello-package" errored(exit status 127): bash: /tmp/check.sh: No such file or directory
      I0113 16:53:03.756022   70092 worker.go:226] prober error: exit status 127

  5. If a containerized application is running in the pod, rectify the fault if the docker service is abnormal. (In normal cases, the docker service is monitored by the monit. If the docker does not exist or is suspended, the docker service is restarted.)

    1. Run the monit summary docker command to check whether the monit is running properly. As shown in the following, status ok indicates that the docker is running properly:
      If Not monitored is displayed, run the monit monitor docker command to monitor the docker process and wait until the docker is started by the monit.
      Service Name      Status        Type
      docker            Status ok     Program
    2. Run the ps -elf | grep /bin/docker command to check whether the docker daemon process is a zombie process. If yes, the monit cannot start the docker. Z indicates that the docker is a zombie process and you need to reboot the node, as shown in the following:
      4 Z root      69301      1  0  80   0 - 0 -     Jan11 ?        00:00:00 [/usr/bin/dockerd] <defunct>
    3. Check whether the alarm is cleared after the docker daemon process recovers. If yes, no further action is required.

  6. Check whether the alarm indicating abnormal status of the instance is frequently reported and cleared.

    1. Log in to the tenant portal, choose Application Operations > Application Operations > Alarm Center > Alarm List. On the displayed page, search for historical alarms by the alarm name and location information (instance name).
    2. Check whether the alarm indicating abnormal status of the instance is continuously reported and cleared within a short period of time. If yes, go to 11. If no, go to 13.

  7. Check whether the CPU and memory loads are too high.

    Run the top command on the node to check the CPU and memory usage of the node. If yes, go to 12. If no, go to 13.
    top - 17:41:59 up 114 days,  8:20,  2 users,  load average: 0.17, 0.40, 0.39
    Tasks: 383 total,   1 running, 379 sleeping,   0 stopped,   3 zombie
    %Cpu(s): 10.0 us,  7.4 sy,  0.0 ni, 82.5 id,  0.1 wa,  0.0 hi,  0.0 si,  0.0 st
    KiB Mem:   7476636 total,  5585736 used,  1890900 free,   481516 buffers
    KiB Swap: 19920888 total,     5824 used, 19915064 free.  3270320 cached Mem
       PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                       
     61926 root      20   0 3324828  32476   9680 S 91.63 0.434   0:00.35 java

  8. Delete redundant applications and resources on the node, and then check whether the alarm is cleared. If yes, no further action is required. If no, go to 13.
  9. Contact technical support for assistance.

Alarm Clearing

This alarm will be automatically cleared after the fault is rectified.

Related Information

None

Translation
Download
Updated: 2019-08-30

Document ID: EDOC1100062365

Views: 34056

Downloads: 31

Average rating:
This Document Applies to these Products
Related Documents
Related Version
Share
Previous Next