No relevant resource is found in the selected language.

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy>Search

Reminder

To have a better experience, please upgrade your IE browser.

upgrade

FusionInsight HD 6.5.0 Software Installation 02

Rate and give feedback:
Huawei uses machine translation combined with human proofreading to translate this document to different languages in order to help you better understand the content of this document. Note: Even the most advanced machine translation cannot match the quality of professional translators. Huawei shall not bear any responsibility for translation accuracy and it is recommended that you refer to the English document (a link for which has been provided).
Installing the Cluster

Installing the Cluster

Failure Due to the Existing of the Node Agent During Cluster Installation

Symptom

During the cluster installation, an error occurs prompting that the node agent has been installed.

Possible Causes

The node agent is installed during cluster installation.

Troubleshooting Method
  1. Check whether the node agent installation information exists in /var/log/Bigdata/controller/controller.log on the active management node.
  2. It takes about 1 minute to rectify the fault.
Procedure
  1. Log in to the active management node as user root using PuTTY.
  2. Check whether the node agent installation information exists in controller.log on the active management node, path to the log file is /var/log/Bigdata/controller/controller.log.

    cat /var/log/Bigdata/controller/controller.log

    If agent is already installed, log files contain the following information:

    ...is found with existing node agent installation.

  3. Obtain the node where the node agent installation fails from the log error information or the error information displayed on the FusionInsight Manager portal.
  4. Log in to the node where the node agent installation fails as user root using PuTTY.
  5. Uninstall the node agent.

    Run the script ${BIGDATA_HOME}/om-agent/nodeagent/setup/uninstall.sh.

  6. On the FusionInsight Manager portal, click Retry to install the component again.
References

None

Cluster Installation Failure Because Some NodeAgents Cannot Be Installed

Symptom

During the cluster installation, an error message is displayed during Setting Up New Nodes with JDK and Node Agent.

Possible Causes

Some nodes cannot communicate with other nodes in the cluster.

Troubleshooting Methods
  • Check which nodes cannot communicate with others properly and restore the communication.
  • Check which nodes cannot communicate with others properly, make sure that the nodes can be deleted, delete the nodes, and reinstall the cluster. It takes about 2 minutes to rectify the fault.
Procedure
  1. Click in the row where the error is reported to check which nodes cannot communicate properly with others.

  2. Click Finish to return to the FusionInsight Manager.
  3. Click Homepage > More > Uninstall to uninstall cluster. Then reinstall cluster, and deselected the fault node when discover hosts.

    NOTE:

    After the node where the error is reported is deselected, the deployment instance topology of each service may change. Modify the topology according to the actual situation.

Failed to Start All Instances in the Starting Cluster Step of Cluster Installation

Symptom

During cluster installation, an error occurs indicating that all instances fail to start in the Starting Cluster step.

Possible Causes

The HDFS checking script runs for more than 60 minutes if a server has a low hardware configuration. In this case, FusionInsight Manager stops the process of the checking script, resulting in a cluster startup failure.

Procedure
  1. Click Retry on the installation WebUI to reinstall the cluster.
  2. After the cluster is successfully installed, click Finish.

Failed to Start Spark During the Cluster Installation

Symptom

During the cluster installation, the system displays a message indicating that the Spark service fails to be started.

Figure 7-3 Spark service fails to be started
Possible Causes

Other similar software is installed on the nodes of the cluster and the software residue is not completely deleted.

Troubleshooting Methods

Check whether the libhadoop.so file of another version exists in /usr/lib64/ of each node. If the file exists, delete it and restart the Spark service.

Procedure
  1. Use PuTTY to log in to each node as user root.
  2. Run the following commands to go to /usr/lib64/ and check whether the libhadoop.so file exists:

    cd /usr/lib64

    find libhadoop*

    If the following information is displayed, the file exists:

    libhadoop.so 
    libhadoop.so.1.0.0

  3. Run the following command to delete the file:

    rm -f libhadoop.so.1.0.0

    rm -f libhadoop.so

  4. Restart the Spark service. The fault is rectified.

DBService Startup Failure After Cluster Installation Due to Incorrect HA Parameter

Scenario

During the DBService or cluster installation, an error occurs prompting that the HA parameter of DBService is incorrect, which results in an installation failure.

Possible Causes
  • The dbservice.floatip parameter of DBService is incorrectly set.
  • The dbservice.mediator.ip parameter of DBService is incorrectly set.
Troubleshooting Methods
  1. If the dbservice.floatip and dbservice.mediator.ip parameters of DBService are incorrectly set, the error log on FusionInsight Manager will display a "Failed to config HA" message. Check whether these parameters are correctly set.
  2. Fault recovery takes about 15 minutes.
Procedure
  1. Log in to FusionInsight Manager and choose Cluster > Service > DBService > Configuration. On the displayed page, view the parameters of DBService.
  2. Check whether the dbservice.floatip parameter is correctly set. The parameter value of dbservice.floatip is a unique IP address on the network that is not in use.

    1. Use PuTTY to log in to a node where DBService is located as user root.
    2. Run the ping dbservice.floatip command. If the floating IP address can be pinged, the IP address is in use.
    3. Use PuTTY to log in to the node that uses the floating IP address and run the ifconfig dbservice.floatip.interface down command to disable the floating IP address. Then, run the ping command again to check whether the floating IP address still exists.
    4. If the IP address can still be pinged, the floating IP address is configured for multiple nodes. In this case, change the IP address of DBService into a unique IP address on FusionInsight Manager.

  3. On the Linux OS, run the route command to check whether the parameter value of dbservice.mediator.ip is the same as the gateway address. If they are different, change the parameter value of dbservice.mediator.ip into the gateway address on FusionInsight Manager.
  4. Log in to FusionInsight Manager, click Cluster > Service, and restart DBService on the displayed page.

DBService Startup Failure During Cluster Installation Due to the Unavailable Port

Symptom

During the cluster installation or DBService restarting, DBService fails to start. An error message printed in the log indicates that port 20051 is occupied.

Possible Causes
  • The default port, port 20051, of DBService is occupied by other processes.
  • Stopping of the DBService process fails and therefore the port is not released.
Troubleshooting Methods
  1. Check whether an "already exists for port:20051" message is displayed in the error log on FusionInsight Manager.
  2. Fault recovery takes about 10 minutes.
Procedure

This section describes how to rectify the fault caused by the occupied port.

  1. Use PuTTY to log in to the host where the DBService installation failure occurred as user root and run the ps -ef | grep "20051" command.
  2. Run the kill command to forcibly stop the process that uses port 20051.
  3. Run the following command in the /tmp and /var/run/FusionInsight-DBService directories. Delete all queried files.

    find . -name "*20051*"

  4. Log in to FusionInsight Manager again, click Cluster > Service, and restart DBService.

Memory Usage of dentry Exceeds the Threshold

Symptom

The cluster OS is Red Hat 6.4, CentOS 6.4, or later versions. After the cluster is installed, the slabtop command is used to view the memory usage on the active NameNode and active ResourceManager. It is found that the dentry memory usage keeps increasing and "ALM-12018 Memory Usage Exceeds the Threshold" is finally reported.

Possible Causes

The version of the nbss-util and nss-softokn RPM packages in the OS is too early and related environment variables are not imported.

Troubleshooting Methods

Contact the OS vendor to upgrade the nbss-util and nss-softokn RPM packaged to a specified version or later and import environment variables as user omm.

Handling process
  1. Use PuTTY to log in to any node as user root. Run the following command to check the version of the nss-softokn:

    rpm -qa | grep nss-softokn

    • If the version of the nss-softokn is earlier than 3.14.3-22, go to Step 2.
    • If the version of the nss-softokn is 3.14.3-22 or later, go to Step 3.

  2. You are advised to upgrade the nss-softokn to 3.14.3-22 or a later version. For details about the upgrade method, contact the OS service provider.
  3. Use PuTTY to log in to any node as user root. Run the following command to check the version of the nss-util:

    rpm -qa | grep nss-util

    • If the version of the nss-util is earlier than 3.16.2.3, go to Step 4.
    • If the version of the nss-util is 3.16.2.3 or later, go to Step 5.

  4. You are advised to upgrade the nss-util to 3.16.2.3 or a later version. For details about the upgrade method, contact the OS service provider.

    NOTE:

    The two RPM packages depend on other RPM packages of later versions. Upgrade the depended RPM packages to later versions.

    For example, upgrade nss-softokn to 3.14.3-22 and nss-util to 3.18.0. Obtain the following RPM packages:

    • nspr-4.10.8-1.el6_6.x86_64.rpm
    • nss-softokn-3.14.3-22.el6_6.x86_64.rpm
    • nss-softokn-freebl-3.14.3-22.el6_6.x86_64.rpm
    • nss-util-3.18.0-1.el6_6.x86_64.rpm

    Upgrade the preceding RPM packages to all nodes, for example, the /opt/rpm directory, and run the following commands to install the RPM packages:

    cd /opt/rpm

    rpm -Uvh *.rpm

  5. Use PuTTY to log in to all nodes as user root. Run the following command to open the /home/omm/.profile file.

    vi /home/omm/.profile

  6. Press Insert to enter the edit mode, add the following content on the last row in the file. Then press Esc and enter :wq to save the modification and exit from the vi editor.

    export NSS_SDB_USE_CACHE=no

  7. Run the reboot command to restart the node.

Failed to Uninstall the Cluster and an Error Message Indicating the HDFS Service Uninstallation Failure Is Displayed

Symptom

The cluster cannot be uninstalled, and an error message "Failed to complete un-installation of cluster 'xxx" is displayed.

Possible Causes

When the size of the HDFS database is too large (larger than 300 TB), the uninstallation duration exceeds 15 minutes. If this occurs, an error message indicating the HDFS uninstallation failure occurs due to the uninstallation timeout. Open the message in the error row, and check whether the following information is contained:

[2014-12-05 18:07:31] RoleInstance cleanup failed [{ScriptExecutionResult=ScriptExecutionResult [exitCode=143, output=, errMsg=]}] for DataNode#10.1.8.216@vm-216. 
[2014-12-05 18:07:31] RoleInstance uninstall failure for ROLE[name: DataNode]. 
[2014-12-05 18:07:31] Role uninstall failure for ServiceName: HDFS. 
[2014-12-05 18:07:31] Service uninstall failure for CLUSTER[name: cluster].
Procedure

This fault can be rectified in two ways:

  • Rectify the fault on the portal.
    1. Click Retry, and wait for the cluster to be uninstalled again.
    2. After the cluster is uninstalled, click Finish.
      NOTE:

      If the fault persists, repeat 1 to 2 until the cluster is successfully uninstalled.

  • Rectify the fault by deleting the HDFS data in the CLI and then uninstalling the cluster again.
    1. Use PuTTY to log in to the node where the HDFS client is located as user root. Log in to the HDFS client, and run the following commands. (/opt/hadoopclient indicates the client directory and path indicates the HDFS data storage directory. Both of the directories can be customized by users.)

      cd /opt/hadoopclient/HDFS/hadoop/bin

      ./hdfs dfs -rm -r -skipTrash <path>

    2. Perform uninstallation-related operations to uninstall the cluster again.

Failed to Start Redis During the Cluster Installation

Symptom

During the cluster installation, the system displays a message indicating that the Redis service fails to be started, as shown in Figure 7-4.

Figure 7-4 Redis start failure
Possible Causes

The number of Redis instances exceeds the maximum number of connections configured for the DBService database.

Troubleshooting Methods

Check whether remaining connection slots are reserved for non-replication superuser connections is displayed in the failure details.

Procedure
  1. Reset dbservice.database.max.connections to a value greater than the number of installed Redis instances.
  2. Save the configuration and restart DBService.
  3. Restart the Redis service. The fault is rectified.

Cluster Installation Fails and the System Reports Incorrect OS Type or Connection Failure

Scenario

Cluster installation fails. The system reports that the OS type is incorrect or the connection fails.

Possible Causes
  • The OS of the faulty node is different from that of the active management node (for example, SUSE and Red Hat are installed on the two nodes, respectively).
  • The SSL/TLS protocol version supported by the OS of the faulty node is earlier than that of the active management node. For the mapping between the OS and the SSL/TLS protocol version, see Table 7-8 in Configuration File Required for the Manager.
Troubleshooting Methods
  • Check whether the OS of the faulty node is the same as that of the active management node.
  • Check whether the SSL/TLS protocol version supported by the OS of the faulty node is earlier than that of the active management node.
Procedure
  1. Use PuTTY to log in as user root to the active management node and faulty node, respectively. After login, run the cat /etc/*-release command on the nodes to check whether their OSs are the same.

    • If the nodes have different OSs, reinstall the OS of the faulty node to be the same as that of the active management node, and ensure that the SSL/TLS protocol version supported by the new OS is the same as or later than that supported by the active management node.
    • If the OSs of the two nodes are the same, go to Step 2.

  2. Query the value of the tls_protocol_min parameter that matches the OS according to Table 7-8 in Configuration File Required for the Manager. In this example, the OS is SUSE 11.3, and the value of tls_protocol_min that matches SUSE 11.3 is sslv3.

  3. Use PuTTY to log in to the active management node as user omm. Run the following command to check the value of the tls_protocol_min parameter. If the value is greater than that queried in Step 2, reinstall the OS of the faulty node and ensure that the SSL/TLS protocol version supported by the new OS is the same as or later than that supported by the active management node.

    cat $CONTROLLER_HOME/etc/om/omsconfigmodle/initconfig/ldap.properties

    LDAP_SERVER_PORT=21750
    TLS_PROTOCOL_MIN=sslv3

Download
Updated: 2019-05-17

Document ID: EDOC1100074555

Views: 5975

Downloads: 6

Average rating:
This Document Applies to these Products
Related Version
Related Documents
Share
Previous Next