No relevant resource is found in the selected language.

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy>Search

Reminder

To have a better experience, please upgrade your IE browser.

upgrade

FusionInsight HD V100R002C60SPC200 Product Description 06

Rate and give feedback:
Huawei uses machine translation combined with human proofreading to translate this document to different languages in order to help you better understand the content of this document. Note: Even the most advanced machine translation cannot match the quality of professional translators. Huawei shall not bear any responsibility for translation accuracy and it is recommended that you refer to the English document (a link for which has been provided).
MapReduce

MapReduce

Container Reuse

Native MapReduce breaks down a job into several tasks for execution. An execution carrier is called a container inside MapReduce. The container stands for a computing capability unit and is a dynamic running JVM process physically.

After a task is completed, Container will be terminated so that ApplicationMaster will re-allocate containers based on resource requests and initialize new tasks. Container Reuse allows a container to automatically obtain new tasks after a task is completed, avoiding container re-allocation and initialization, eliminating container start and recycling, and improving job execution efficiency. After optimization, MapReduce cluster computing performance is greatly improved.

Container reuse applies only to the same type of tasks, for example, between map tasks or between reduce tasks. Reuse does not apply to different types of tasks, for example, between map and reduce tasks.

The feature is provided in community Hadoop 1.0, but is not provided in Hadoop 2.0. The DataSight implementation mode is different from that of the community Hadoop 1.0. DataSight can better meets MapReduce localization requirements by setting parameters.

Figure 4-13 Container Reuse

Improving MapReduce Performance by Optimizing its Merge/Sort Process in Specific Scenarios

Overview

The following figures show the MapReduce job execution process.

Figure 4-14 MapReduce job
Figure 4-15 MapReduce job execution process

The reduce phase has three different steps: Copy, Sort (which should really be called Merge) and Reduce. In the Copy phase, Reducer tries to fetch the output of Maps from NodeManagers and store it on Reducer either in memory or on disk. The shuffle (Sort & Merge) phase then begins. All the fetched map output is being sorted, and segments from different map outputs are merged before being sent to the reducer function. If a large amount of Maps output data needs to be processed in a Job, the Shuffle process is time-consuming. For specific tasks (for example, SQL tasks such as hash join and hash aggregation), sorting is not mandatory during the Shuffle process. However, this operation is performed by default.

This feature is enhanced by using the MapReduce API, which can automatically close the Sort process for such tasks. When sorting is disabled, the API directly merges the fetched Maps output data and sends the data to the reducer function. This can reduce the time waste caused by sorting, therefore significantly improving the efficiency of SQL tasks.

Small Log File Problem Solved After Optimization of MR History Server

After jobs running on YARN are executed, NodeManager uses LogAggregationService to collect and send generated logs to HDFS and deletes them from the local file system. After the logs are stored to HDFS, they are managed by MR HistoryServer. LogAggregationService will merge local logs generated by containers to a log file and upload this file to the HDFS, reducing the number of log files. However, in a large-scale and busy cluster, there will be excessive log files on HDFS after long-term running.

For example, if there are 20 nodes, about 18 million log files are generated within the default clean-up period (15 days), which occupy about 18 GB of the memory of NameNode and slow down the HDFS system response.

Only the read and delete operations are required for files stored to HDFS. Therefore, Hadoop Archives function can be used to periodically archive the log file directory.

Archiving Logs

The AggregatedLogArchiveService module is added to MR HistoryServer to periodically check the number of files in the log directory. When the number of files reaches the threshold, AggregatedLogArchiveService starts an archiving task to archive log files. After archiving, it deletes the original log files to reduce log files on HDFS.

Cleaning Archived Logs

Hadoop Archives does not support the delete operation in archive files. Therefore, the entire archive log package must be deleted upon log clean-up. The latest log generation time is obtained by modifying the AggregatedLogDeletionService module. If all log files meet the clean-up requirements, the archive log package is deleted.

Browsing Archived Logs

Hadoop Archives allows URI-based access to file content in the archive log package. Therefore, if MR History Server detects that the original log file does not exist during file access, it directly redirects the URI to the archive log package to access the archived log file.

NOTE:
  1. This function invokes Hadoop Archives of HDFS for log archiving. Execution of an archiving task by Hadoop Archives is running an MR application. Therefore, after an archiving task is executed, an MR execution record is added.
  2. Logs archived by this function are collected by the log aggregation function. Therefore, this function is valid only when the log aggregation function is enabled.

Translation
Download
Updated: 2019-04-10

Document ID: EDOC1000104139

Views: 5977

Downloads: 64

Average rating:
This Document Applies to these Products
Related Documents
Related Version
Share
Previous Next