No relevant resource is found in the selected language.

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy>Search

Reminder

To have a better experience, please upgrade your IE browser.

upgrade

FusionInsight HD 6.5.0 Product Description 02

Rate and give feedback :
Huawei uses machine translation combined with human proofreading to translate this document to different languages in order to help you better understand the content of this document. Note: Even the most advanced machine translation cannot match the quality of professional translators. Huawei shall not bear any responsibility for translation accuracy and it is recommended that you refer to the English document (a link for which has been provided).
Hive

Hive

Basic Principle

Function

Hive is an open-source data warehouse built on Hadoop. It provides batch computing capability for the big data platform and is able to batch analyze and summarize structured and semi-structured data for data calculation. In addition, Hive provides storage of structured data and basic data analysis services using the Hive query language (HQL), a language like the structured query language (SQL). Hive converts HQL statements into MapReduce tasks for querying and analyzing massive data stored in Hadoop clusters.

Hive provides the following functions:

  • Analyzes massive structured data and summarizes analysis results.
  • Allows complex MapReduce jobs to be compiled in SQL languages.
  • Supports data storage formats, including JavaScript object notation (JSON), comma separated values (CSV), TextFile, RCFile, SequenceFile, and Optimized Row Columnar (ORC).
Structure

Hive is a single-instance service process that provides services by translating HQL statements into related MapReduce jobs or HDFS operations. Figure 2-41 shows the Hive structure.

Figure 2-41 Hive structure
Table 2-9 Module description

Concept

Description

HiveServer

Multiple HiveServers can be deployed in a cluster in load sharing mode. HiveServer provides Hive database services externally, translates HQL statements submitted by users into related Yarn tasks, or HDFS operations to complete data extraction, conversion, and analysis.

MetaStore

  • Multiple MetaStores can be deployed in a cluster in load sharing mode. MetaStore provides Hive metadata services as well as reads, writes, maintains, and modifies the structure and attributes of Hive tables.
  • MetaStore provides Thrift interfaces for HiveServer, Spark, WebHCat, and other MetaStore clients to access and operate metadata.

WebHCat

Multiple WebHCats can be deployed in a cluster in load sharing mode. WebHCat provides Rest interfaces and uses them to run the Hive command to submit MapReduce tasks.

Hive client

Includes the human-machine command-line interface (CLI) Beeline, JDBC driver provided for JDBC applications, Python driver provided for Python applications, and HCatalog JAR packages provided for MapReduce.

ZooKeeper cluster

As a temporary node, ZooKeeper records the IP address list of each HiveServer instance. The client driver connects to ZooKeeper to obtain the list and select the HiveServer instance according to the routing mechanism.

HDFS/HBase cluster

Hive table data is stored in the HDFS cluster.

MapReduce/Yarn cluster

Provides distributed computing services. Most Hive data operations rely on MapReduce. The main function of HiveServer is to translate HQL statements into MapReduce jobs to process massive data.

HCatalog is a table and storage management layer on Hadoop. Users who use different data processing tools (such as MapReduce) can read and write data more easily in the cluster by using HCatalog. As shown in Figure 2-42, development personnel's application programs use HTTP requests to access Hadoop MapReduce (Yarn), Pig, Hive, and HCatalog DDL. If the request is an HCatalog DDL command, it will be executed directly. If the request is a MapReduce, Pig, or Hive task, it will be placed in the queue of the WebHCat (Templeton) server, allowing the progress to be monitored or stopped. Development personnel specify the paths to the processing results of MapReduce, Pig, and Hive tasks in the HDFS.

Figure 2-42 WebHCat logical architecture
Principle

Hive functions as a data warehouse based on HDFS and MapReduce architecture and translates HQL statements into MapReduce jobs or HDFS operations.

Figure 2-43 shows the Hive structure.

  • Metastore - reads, write, and update metadata such as tables, columns, and partitions. Its lower layer is relational databases.
  • Driver - manages the life cycle of HQL execution and participates in the entire Hive job execution.
  • Compiler - translates HQL statements into a series of interdependent MapReduce jobs.
  • Optimizer - is classified into logical optimizer and physical optimizer to optimize HQL execution plans and MapReduce jobs, respectively.
  • Executor - executes Map and Reduce jobs based on job dependencies.
  • ThriftServer - provides thrift interfaces as the servers of JDBC and ODBC, and integrates Hive and other applications.
  • Clients - includes Web UI and JDBC/ODBC interfaces, and provides interfaces for user access.
Figure 2-43 Hive effect

Working Principles of Hive CBO

Principles

CBO is short for Cost-Based Optimization.

The optimization objective is as follows:

During compilation, the CBO calculates the most efficient joining sequence based on tables and query conditions involved in query statements to reduce time and resources required for query.

In Hive, the CBO is implemented as follows:

Hive uses open source component Apache Calcite to implement the CBO. SQL statements are first converted into Hive ASTs and then into RelNodes that can be identified by Calcite. After Calcite adjusts the joining sequence in RelNodes, Hive converts RelNodes into ASTs to continue the logical and physical optimization of Hive. Figure 2-44 shows the implementation process.

Figure 2-44 Implementation process

Calcite adjusts the joining sequence as follows:

  1. A table is selected as the first table from the tables to be joined.
  2. The second and third tables are selected based on the cost. In this way, multiple different execution plans are obtained.
  3. A plan that costs the least is calculated and serves as the final sequence.

The cost calculation method is as follows:

In the current version, costs are measured based on the number of data entries after joining. Fewer data entries mean less cost. The number of joined data entries depends on the selection rate of joined tables. The number of data entries in a table is obtained based on the table-level statistics.

The number of data entries in a table after filtering is estimated based on the column-level statistics, max, min, and Number of Distinct Values (NDV).

For example, table table_a exists. The total number of data entries in the table is 1,000,000 and the NDV is 50.

Assume that the query condition is as follows:

Select * from table_a where colum_a='value1';

The estimated number of queried data entries is: 1000000 x 1/50 = 20000. The selection rate is 2%.

The following takes the TPC-DS Q3 as an example to describe how the CBO adjusts the joining sequence.

select 
    dt.d_year, 
    item.i_brand_id brand_id, 
    item.i_brand brand, 
    sum(ss_ext_sales_price) sum_agg 
from 
    date_dim dt, 
    store_sales, 
    item 
where 
    dt.d_date_sk = store_sales.ss_sold_date_sk 
    and store_sales.ss_item_sk = item.i_item_sk 
    and item.i_manufact_id = 436 
    and dt.d_moy = 12 
group by dt.d_year , item.i_brand , item.i_brand_id 
order by dt.d_year , sum_agg desc , brand_id 
limit 10;

Statement explanation: The statement indicates that inner joining is performed for three tables: fact table store_sales with about 2,900,000,000 data entries, dimension table date_dim with about 73,000 data entries, and dimension table item with about 18,000 data entries. Each table has filtering conditions. Figure 2-45 shows the joining relationship.

Figure 2-45 Joining relationship

The CBO must first select the tables that bring the best filtering effect for joining.

By analyzing min, max, NDV, and the number of data entries, the CBO estimates the selection rates of different dimension tables, as shown in Table 2-10.

Table 2-10 Data filtering

Table

Number of Original Data Entries

Number of Data Entries After Filtering

Selection Rate

date_dim

73000

6200

8.5%

item

18000

19

0.1%

The selection rate can be estimated as follows: Selection rate = Number of data entries after filtering/Number of original data entries

As shown in the preceding table, the item table has a better filtering effect. Therefore, the CBO joins the itemtable first before joining thedate_dim table.

Figure 2-46 shows the joining process when the CBO is disabled.

Figure 2-46 Joining process when the CBO is disabled

Figure 2-47 shows the joining process when the CBO is enabled.

Figure 2-47 Joining process when the CBO is enabled

After the CBO is enabled, the number of intermediate data entries is reduced from 495,000,000 to 2,900,000, and the execution time is remarkably reduced.

Relationship with Components

Relationship Between Hive and HDFS

Hive is the subproject of Apache Hadoop. Hive uses the Hadoop Distributed File System (HDFS) as the file storage system. Hive parses and processes structured data, and HDFS provides highly reliable lower-layer storage support for Hive. All data files in the Hive database are stored in HDFS, and all data operations on Hive are also performed using HDFS APIs.

Relationship Between Hive and MapReduce

Hive data computing depends on MapReduce. MapReduce is a subproject of the Apache Hadoop project. It is a parallel computing framework based on HDFS. During data analysis, Hive translates HQL statements submitted by users into MapReduce jobs and submits the jobs for MapReduce to execute.

Relationship Between Hive and DBService

MetaStore (metadata service) of Hive processes the structure and attribute information of Hive databases, tables, and partitions. The information needs to be stored in a relational database and is maintained and processed by MetaStore. In FusionInsight HD, the relational database is maintained by the DBService component.

Download
Updated: 2019-05-17

Document ID: EDOC1100074548

Views: 3095

Downloads: 36

Average rating:
This Document Applies to these Products
Related Documents
Related Version
Share
Previous Next