No relevant resource is found in the selected language.

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy>Search

Reminder

To have a better experience, please upgrade your IE browser.

upgrade

FusionCloud 6.3.1.1 Troubleshooting Guide 02

Rate and give feedback :
Huawei uses machine translation combined with human proofreading to translate this document to different languages in order to help you better understand the content of this document. Note: Even the most advanced machine translation cannot match the quality of professional translators. Huawei shall not bear any responsibility for translation accuracy and it is recommended that you refer to the English document (a link for which has been provided).
Single Abnormal ETCD Pod

Single Abnormal ETCD Pod

Symptom

Fault Symptom

A single etcd pod is abnormal.

Fault Locating
  • Checking the status of etcd pods on the OM zone
    1. Use PuTTY to log in to the om_core1_ip node.

      The default username is paas, and the default password is QAZ2wsx@123!.

    2. Run the following command to query the status of etcd pods on the OM zone:

      kubectl get pod -nom -owide|grep etcd|grep -v cse

      etcd-backup-server-paas-10-118-16-189        1/1       Running    0          3d        10.118.16.189   paas-10-118-16-189
      etcd-backup-server-paas-10-118-16-231        1/1       Running    0          37m       10.118.16.231   paas-10-118-16-231
      etcd-backup-server-paas-10-118-16-53         1/1       Running    0          2h        10.118.16.53    paas-10-118-16-53
      etcd-event-server-paas-10-118-16-189         1/1       Running    0          3d        10.118.16.189   paas-10-118-16-189
      etcd-event-server-paas-10-118-16-231         1/1       Running    0          37m       10.118.16.231   paas-10-118-16-231
      etcd-event-server-paas-10-118-16-53          1/1       Running    0          2h        10.118.16.53    paas-10-118-16-53
      etcd-network-server-paas-10-118-16-189       1/1       Running    0          3d        10.118.16.189   paas-10-118-16-189
      etcd-network-server-paas-10-118-16-231       1/1       Running    0          37m       10.118.16.231   paas-10-118-16-231
      etcd-network-server-paas-10-118-16-53        1/1       Running    0          2h        10.118.16.53    paas-10-118-16-53
      etcd-server-paas-10-118-16-189               1/1       Running    0          3d        10.118.16.189   paas-10-118-16-189
      etcd-server-paas-10-118-16-231               1/1       Running    3          37m       10.118.16.231   paas-10-118-16-231
      etcd-server-paas-10-118-16-53                1/1       Running    0          2h        10.118.16.53    paas-10-118-16-53
  • Checking the status of etcd pods on the tenant management zone
    1. Use PuTTY to log in to the om_core1_ip node.

      The default username is paas, and the default password is QAZ2wsx@123!.

    2. Run the following command to query the status of etcd pods on the management zones:
      kubectl get pod -n manage -owide|grep etcd|grep -v cse |grep -v elb
      etcd-0             1/1       Running   2          31m       10.184.42.123    paas-manage-core5-7d03ba8e-823f-gklq7
      etcd-1             1/1       Running   2          31m       10.184.41.116    paas-manage-core4-7d03ba8e-823f-cp30f
      etcd-2             1/1       Running   2          31m       10.177.119.50    paas-manage-core3-7d03ba8e-823f-f3j5m
      etcd-event-0       1/1       Err      2          31m       172.16.9.165     paas-manage-core5-7d03ba8e-823f-gklq7
      etcd-event-1       1/1       Running   2          31m       172.16.23.120    paas-manage-core3-7d03ba8e-823f-f3j5m
      etcd-event-2       1/1       Running   2          31m       172.16.24.176    paas-manage-core4-7d03ba8e-823f-cp30f
      etcd-network-0     1/1       Running   2          31m       172.16.23.119    paas-manage-core3-7d03ba8e-823f-f3j5m
      etcd-network-1     1/1       Running   2          31m       172.16.9.166     paas-manage-core5-7d03ba8e-823f-gklq7
      etcd-network-2      1/1       Running   2          31m       172.16.24.175    paas-manage-core4-7d03ba8e-823f-cp30f        
      NOTE:

      The pods whose status is not Running are abnormal.

Troubleshooting

Prerequisites

The paas user has been added to the whitelist. For details, see Operations When the sudoCommand Failed to Be Run.

Locating the Root Cause of a Fault
  1. Use PuTTY to log in to the om_core1_ip node.

    The default username is paas, and the default password is QAZ2wsx@123!.

  2. Run the following command to query the IP address of the node where etcd-0 resides:

    kubectl get pod etcd-0 -nmanage -oyaml | grep hostIP

    hostIP:10.154.248.63
    NOTE:

    In the preceding command, etcd-0 is the name of the abnormal pod obtained in section Symptom.

  3. Log in as the paas user to the node where etcd-0 resides.
  4. Perform the following operations to locate the root cause of the etcd fault and rectify the fault accordingly:

    • Container network problems
      1. Run the following command to log in to the etcd-0 container:

        sudo docker ps |grep etcd-0

        0a1c9946060f        10.184.42.33:20202/root/cfe-etcd:2.2.4                   "/bin/sh -c 'umask 06"   3 minutes ago       Up 3 minutes                            k8s_etcd.902abe6d_etcd-0_manage_4072c181-888c-11e7-9423-286ed489be96_29e8d83b
        08131be7509a        10.184.42.33:20202/root/default/cfe-pause:2.8.7          "/pause"                 2 hours ago         Up 2 hours                              k8s_POD.2cdee072_etcd-0_manage_4072c181-888c-11e7-9423-286ed489be96_f5970a1f

        The container ID is displayed in the command output. In this case, the container ID is 0a1c9946060f.

        sudo docker exec -it 0a1c9946060f sh

      2. Run the following command to check whether network connection between etcd-0, etcd-1, and etcd-2 is normal:

        ping etcd-1.etcd.manage

        4.2$ ping etcd-1.etcd.manage
        PING etcd-1.etcd.manage.svc.cluster.local (10.184.41.116) 56(84) bytes of data.
        64 bytes from etcd-network-0.etcd-network.manage.svc.cluster.local (10.184.41.116): icmp_seq=1 ttl=63 time=1.53 ms
        64 bytes from etcd-network-0.etcd-network.manage.svc.cluster.local (10.184.41.116): icmp_seq=2 ttl=63 time=1.97 ms

        If the preceding information is not displayed, contact technical support to rectify container network problems.

    • Disk space problems

      Run the following commands to check disk space:

      cd /var/paas/run

      df -h . | grep 100%

      If any information is displayed, the disk space is used up. Clear the disk space.

    • Disk I/O problems

      Run the following command to check system I/O status:

      iostat -x 1

      Linux 3.12.49-11-default (SZV1000269249) ?04/23/17 ?_x86_64_?(16 CPU) 
        
      avg-cpu:  %user   %nice %system %iowait  %steal   %idle 
                  8.37    0.04   13.11    4.63    0.00   73.85 
        
      Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util 
      xvda              0.27    83.42    1.12   28.17    22.88   551.59    39.22     0.26    8.99    5.95    9.11   0.52   100 
      xvde              0.75   219.17   12.87  355.79   276.25  8284.19    46.44     1.04    2.83    4.69    2.76   0.35  99 
      dm-0              0.00     0.00    0.07    0.77     0.27     3.08     8.00     0.00    4.06    1.45    4.29   0.94   0.08 
      dm-1              0.00     0.00   11.79   15.72   265.92   251.92    37.64     0.25    8.91    5.05   11.80   0.46   1.25 
      dm-2              0.00     0.00   11.79   15.72   265.92   251.92    37.64     0.25    8.91    5.05   11.81   0.46   100 
      xvdf              0.00   240.41    4.03  610.73   169.29  4264.36    14.42     1.71    2.78    4.61    2.76   0.97 100

      In the command output, if a large number of 100 and 99 are displayed in the %util column, system I/O is used up. In this case, contact IaaS technical support for system optimization.

    • If the root cause is not located, or the fault is not solved, see Handling etcd Faults in the OM Zone and Handling etcd Faults in the Tenant Management Zone.

Handling etcd Faults in the OM Zone

This section uses the etcd fault on the OM-Core03 node as an example.

  1. Use PuTTY to log in to the om_core3_ip node.

    The default username is paas, and the default password is QAZ2wsx@123!.

  2. Remove the manifest file.

    NOTE:

    When restoring etcd and etcd-network, change etcd-event in the command to etcd and etcd-network respectively.

    cd /var/paas/kubernetes/manifests/

    mv etcd-event.manifest ..

  3. Delete the /tmp directory.

    The /tmp directory is used to store data files. In this case, tmp is under the etcd-event/, etcd/, and etcd-network/ directories.

    cd /var/paas/run

    mkdir ../tmp

    mv etcd-event/ ../tmp

  4. Use PuTTY to log in to the om_core1_ip node.

    The default username is paas, and the default password is QAZ2wsx@123!.

  5. Run the following command to log in to the etcd-event container:

    sudo docker ps | grep etcd-event

    Information similar to the following is returned:

    6d774ac2ac2e      cfe-etcd:2.8.7                                            "/bin/sh -c 'umask 06"   2 days ago          Up 2 days                               k8s_etcd-container.d6f90091_etcd-event-server-10.184.42.132_om_9f4b2d62d846556015bb495930f7fa4f_6a546c2e
    b577e0f5e45a      paas-cfe-pause-bootstrap                                  "/pause"                 2 days ago          Up 2 days                               k8s_POD.6d5cdc5e_etcd-event-server-10.184.42.132_om_9f4b2d62d846556015bb495930f7fa4f_561795ae

    Record the container ID. In this case, the container ID is 6d774ac2ac2e.

    sudo docker exec -it 6d774ac2ac2e bash

  6. Check the IP address and port of each node in the etcd cluster.

    NOTE:

    By default, 4001 is the client port of the etcd cluster, 4002 is the client port of the etcd-event cluster, and 4003 is the client port of the etcd-network cluster.

    ETCDCTL_API=3 /start-etcd --cacert /srv/kubernetes/ca.cer --cert /srv/kubernetes/server.cer --key /srv/kubernetes/server_key.pem --endpoints https://127.0.0.1:4002 member list

    Information similar to the following is returned:

    1f4397f9956e1e8b, started, infra1, https://10.184.43.79:2381, https://10.184.43.79:4002
    9a3dd24ebfc5c212, started, infra2, https://10.177.119.155:2381, https://10.177.119.155:4002
    fc4a4cd2cf50cbf1, started, infra0, https://10.184.42.132:2381, https://10.184.42.132:4002

    Record the IP address and port of each node. In this case, the IP addresses and ports are https://10.184.43.79:4002, https://10.177.119.155:4002, and https://10.184.42.132:4002.

  7. Query the status of the etcd cluster:

    NOTE:

    By default, 4001 is the client port of the etcd cluster, 4002 is the client port of the etcd-event cluster, and 4003 is the client port of the etcd-network cluster.

    ETCDCTL_API=3 /start-etcd --cacert /srv/kubernetes/ca.cer --cert /srv/kubernetes/server.cer --key /srv/kubernetes/server_key.pem --endpoints https://10.184.43.79:4002, https://10.177.119.155:4002, https://10.184.42.132:4002 endpoint status

    2017-08-18 20:14:32.663688 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated
    Failed to get the status of endpoint https://10.177.119.155:4002 (context deadline exceeded)
    https://10.184.42.132:4002, fc4a4cd2cf50cbf1, 3.1.9, 8.2 MB, false, 17, 2617441
    https://10.184.43.79:4002, 1f4397f9956e1e8b, 3.1.9, 8.4 MB, true, 17, 2617441

    The status of https://10.177.119.155:4002 is abnormal. Check the node ID obtained in Step 6. In this case, the node ID is 9a3dd24ebfc5c212.

  8. Delete the node whose status cannot be queried.

    NOTE:

    By default, 4001 is the client port of the etcd cluster, 4002 is the client port of the etcd-event cluster, and 4003 is the client port of the etcd-network cluster.

    The ID of the node to be deleted was obtained in Step 7.

    ETCDCTL_API=3 /start-etcd --cacert /srv/kubernetes/ca.cer --cert /srv/kubernetes/server.cer --key /srv/kubernetes/server_key.pem --endpoints https://10.184.42.132:4002,https://10.184.43.79:4002,https://10.177.119.155:4002 member remove 9a3dd24ebfc5c212

    Information similar to the following is returned:

    2017-08-18 20:20:08.659346 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated
    Member 9a3dd24ebfc5c212 removed from cluster b2d484e5f23f7a6e

  9. Use PuTTY to log in to the om_core1_ip node.

    The default username is paas, and the default password is QAZ2wsx@123!.

  10. Move the etcd-event.manifest file to the OM-Core03 node.

    cd /var/paas/kubernetes/manifests/

    mv ../etcd-event.manifest .

  11. Run •Checking the status of ... to check that the pod is in Running status, which indicates that the fault has been rectified.
  12. Run the following commands to delete the temporary directory:

    /var/paas

    rm -rf tmp/

Handling etcd Faults in the Tenant Management Zone

This section uses a faulty pod in etcd-0 as an example.

  1. Use PuTTY to log in to the om_core1_ip node.

    The default username is paas, and the default password is QAZ2wsx@123!.

  2. Run the following command to query the IP address of the node where etcd-0 resides:

    kubectl get pod etcd-0 -nmanage -oyaml | grep hostIP

     hostIP:10.154.248.63

    In the preceding command, etcd-0 indicates the name of the abnormal pod.

  3. Log in as the paas user to the node where etcd-0 resides.
  4. Stop the abnormal etcd-0 container.

    NOTE:

    When restoring etcd-event and etcd-network, change etcd-server in the command to etcd-event and etcd-network-server respectively.

    sudo docker ps | grep etcd-0 | awk '{print $1}' | xargs sudo docker kill

  5. Delete the /tmp directory.

    The /tmp directory is used to store data files. In this case, tmp is under the etcd/, etcd-event/, and etcd-network/ directories.

    cd /var/paas/run

    mkdir ../tmp

    mv etcd/ ../tmp

  6. Run •Checking the status of ... to check that the pod is in Running status, which indicates that the fault has been rectified.
  7. Run the following commands to delete the temporary directory:

    /var/paas

    rm -rf tmp/

Translation
Download
Updated: 2019-06-10

Document ID: EDOC1100063248

Views: 22849

Downloads: 37

Average rating:
This Document Applies to these Products
Related Documents
Related Version
Share
Previous Next