No relevant resource is found in the selected language.

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy>Search


To have a better experience, please upgrade your IE browser.


FusionInsight HD V100R002C60SPC200 Product Description 06

Rate and give feedback :
Huawei uses machine translation combined with human proofreading to translate this document to different languages in order to help you better understand the content of this document. Note: Even the most advanced machine translation cannot match the quality of professional translators. Huawei shall not bear any responsibility for translation accuracy and it is recommended that you refer to the English document (a link for which has been provided).


Data Block Colocation

In the offline data statistics collection scenario, Join is a commonly used computing function and is implemented in MapReduce as follows:

  1. The Map task sorts two-table records into Join keys and values, implements Hash partition by Join Key, and sends the data to different Reduce tasks for processing.
  2. Reduce tasks read data in the left table recursively in the nested loop mode and poll each line of the right table. If Join key values are identical, Join results are output.

The biggest problem of the preceding method is the great performance reduction of the Join operation. This is because that a large amount of network data transfer is required during transferring the data stored in different nodes from MAP to Reduce. The following figure shows this process.

Figure 4-5 Data transmission in the non-colocation scenario

Datasheets are stored in physical file system by HDFS block. Therefore, if the two blocks that need to join are put into the same machine accordingly after partition them with Join Key, you can obtain the results directly from Map Join in the local node without any data transfer in the Reduce process of the Join operation. This will greatly improve the performance.

Through this feature, the same distribution ID can be specified for FileA and FileB that need association and summary computing so that all blocks are distributed at the same location. Computing can be performed without cross-node data reading, which greatly improves the MR Join performance.

Figure 4-6 Data block distribution in colocation and non-colocation scenarios

Damaged Hard Disk Volume Configuration

In open-source release, if multiple data storage volumes are configured for DataNode, DataNode stops providing services by default when one volume is damaged. if a configuration item dfs.datanode.failed.volumes.tolerated is set to specify the number of damaged volumes that can be tolerated, DataNode continues to provide services when the number of damaged volumes does not exceed the threshold.

The value of dfs.datanode.failed.volumes.tolerated must be greater than or equal to 0. The default value is 0, as shown in Figure 4-7.

Figure 4-7 Item being set to 0

For example, three data storage volumes are mounted to a DataNode, and dfs.datanode.failed.volumes.tolerated is set to 1. In this case, if one data storage volume of DataNode cannot be used, this DataNode can still provide services, as shown in Figure 4-8.

Figure 4-8 Item being set to 1

This native configuration item has some defects. When the number of data storage volumes in each DataNode is inconsistent, you need to configure each DataNode independently instead of generating the unified configuration file for all nodes.

For example, there are three DataNodes in the cluster. The first node has three data directories, the second node has four, and the third node has five. If you want to ensure that DataNode services are available when only one data directory is available, you need to perform the configuration as shown in Figure 4-9.

Figure 4-9 Attribute configuration before being strengthened

In the HDFS of FusionInsight version, this configuration item is strengthened, with a value of -1 added. When this configuration item is set to -1, DataNode can provide services as long as one data storage volume in all DataNodes is available.

To resolve the problem in the preceding example, set this configuration to -1, as shown in Figure 4-10.

Figure 4-10 Attribute configuration after being strengthened

HDFS Startup Acceleration

In the HDFS, the NameNode needs to load the metadata file, fsimage during startup and then waits until the DataNode starts and reports data blocks. If the percentage of data blocks reported by the DataNode reaches the specified threshold, the NameNode exits the Safemode to complete the startup. If the number of files stored on the HDFS reaches the million or billion level, the two processes are time-consuming and will lead to a long startup time of the NameNode.This version optimizes the process of loading the matedate fsimage.

In the open-source HDFS, the fsimage file stores all types of metadata. Each type of metadata (such as the file metadata and folder metadata) is independently stored in a section. Sections are loaded in serial mode. If a large number of files and folders are stored on the HDFS, loading of sessions is time-consuming and will lead to a long HDFS startup time. In the HDFS of the Huawei, the NameNode divides each type of metadata by segments and stores the data in multiple sections when generating the fsimage file. When the NameNode starts, sections are loaded in parallel mode. This measure greatly reduces the startup time of the NameNode and therefore significant accelerates the HDFS startup.

Tag-based Block Placement Policies

Users can place the HDFS data blocks according to the data characteristics. One HDFS directory maps to one node tag expression and one DataNode maps to one or more tags. The first tag-based block placement policy determines the DataNode node range for storing file under a specified directory according to the node tag expression of the file. Then the file will be stored according to the next specified block placement policy, as shown in Figure 4-11.

  • The data in /HBase is stored in A, B, and D.
  • The data in /Spark is stored in A, B, D, E, and F.
  • The data in /user is stored in C, D, and F.
  • The data in /user/shl is stored in A, E, and F.
Figure 4-11 Example of tag-based block placement policy
Updated: 2019-04-10

Document ID: EDOC1000104139

Views: 5883

Downloads: 64

Average rating:
This Document Applies to these Products
Related Documents
Related Version
Previous Next