No relevant resource is found in the selected language.

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy>Search

Reminder

To have a better experience, please upgrade your IE browser.

upgrade

FusionInsight HD 6.5.0 Product Description 02

Rate and give feedback:
Huawei uses machine translation combined with human proofreading to translate this document to different languages in order to help you better understand the content of this document. Note: Even the most advanced machine translation cannot match the quality of professional translators. Huawei shall not bear any responsibility for translation accuracy and it is recommended that you refer to the English document (a link for which has been provided).
MapReduce

MapReduce

JobHistoryServer HA

JobHistoryServer (JHS) is the server used to view historical MapReduce task information. Currently, the open source JHS supports only single-instance services. JHS HA can solve the problem that an application fails to access the MapReduce interface when a single point of failure (SPOF) occurs on the JHS. As a result, the application fails to be executed, which greatly improves the high availability of the MapReduce service.

Figure 4-21 Status transition of the JobHistoryServer HA active/standby switchover

JobHistoryServer High availability

  • ZooKeeper is used to implement active/standby election and switchover.
  • JobHistoryServer uses the floating IP address to provide services externally.
  • Both the JHS single-instance and HA deployment modes are supported.
  • Only one node starts the JHS process at a time point to prevent multiple JHS operations from processing the same file.
  • You can perform scale-out, scale-in, instance migration, upgrade, and health check.

Improving MapReduce Performance by Optimizing its Merge/Sort Process in Specific Scenarios

Overview

The following figures show the MapReduce job execution process.

Figure 4-22 MapReduce job
Figure 4-23 MapReduce job execution process

The reduce phase has three different steps: Copy, Sort (which should really be called Merge) and Reduce. In the Copy phase, Reducer tries to fetch the output of Maps from NodeManagers and store it on Reducer either in memory or on disk. The shuffle (Sort & Merge) phase then begins. All the fetched map output is being sorted, and segments from different map outputs are merged before being sent to the reducer function. If a large amount of Maps output data needs to be processed in a Job, the Shuffle process is time-consuming. For specific tasks (for example, SQL tasks such as hash join and hash aggregation), sorting is not mandatory during the Shuffle process. However, this operation is performed by default.

This feature is enhanced by using the MapReduce API, which can automatically close the Sort process for such tasks. When sorting is disabled, the API directly merges the fetched Maps output data and sends the data to the reducer function. This can reduce the time waste caused by sorting, therefore significantly improving the efficiency of SQL tasks.

Small Log File Problem Solved After Optimization of MR History Server

After jobs running on YARN are executed, NodeManager uses LogAggregationService to collect and send generated logs to HDFS and deletes them from the local file system. After the logs are stored to HDFS, they are managed by MR HistoryServer. LogAggregationService will merge local logs generated by containers to a log file and upload this file to the HDFS, reducing the number of log files. However, in a large-scale and busy cluster, there will be excessive log files on HDFS after long-term running.

For example, if there are 20 nodes, about 18 million log files are generated within the default clean-up period (15 days), which occupy about 18 GB of the memory of NameNode and slow down the HDFS system response.

Only the read and delete operations are required for files stored to HDFS. Therefore, Hadoop Archives function can be used to periodically archive the log file directory.

Archiving Logs

The AggregatedLogArchiveService module is added to MR HistoryServer to periodically check the number of files in the log directory. When the number of files reaches the threshold, AggregatedLogArchiveService starts an archiving task to archive log files. After archiving, it deletes the original log files to reduce log files on HDFS.

Cleaning Archived Logs

Hadoop Archives does not support the delete operation in archive files. Therefore, the entire archive log package must be deleted upon log clean-up. The latest log generation time is obtained by modifying the AggregatedLogDeletionService module. If all log files meet the clean-up requirements, the archive log package is deleted.

Browsing Archived Logs

Hadoop Archives allows URI-based access to file content in the archive log package. Therefore, if MR History Server detects that the original log file does not exist during file access, it directly redirects the URI to the archive log package to access the archived log file.

NOTE:
  • This function invokes Hadoop Archives of HDFS for log archiving. Execution of an archiving task by Hadoop Archives is running an MR application. Therefore, after an archiving task is executed, an MR execution record is added.
  • Logs archived by this function are collected by the log aggregation function. Therefore, this function is valid only when the log aggregation function is enabled.
Download
Updated: 2019-05-17

Document ID: EDOC1100074548

Views: 3210

Downloads: 36

Average rating:
This Document Applies to these Products
Related Documents
Related Version
Share
Previous Next