MapReduce Service (MRS) 3.3.1-LTS Usage Guide (for Huawei Cloud Stack 8.3.1) 05
Using Hive
- Using Hive from Scratch
- Configuring Hive Parameters
- Hive SQL
- Permission Management
- Using a Hive Client
- Using HDFS Colocation to Store Hive Tables
- Using the Hive Column Encryption Function
- Customizing Row Separators
- Deleting Single-Row Records from Hive on HBase
- Configuring HTTPS/HTTP-based REST APIs
- Enabling or Disabling the Transform Function
- Access Control of a Dynamic Table View on Hive
- Specifying Whether the ADMIN Permissions Is Required for Creating Temporary Functions
- Using Hive to Read Data in a Relational Database
- Supporting Traditional Relational Database Syntax in Hive
- Creating User-Defined Hive Functions
- Enhancing beeline Reliability
- Viewing Table Structures Using the show create Statement as Users with the select Permission
- Writing a Directory into Hive with the Old Data Removed to the Recycle Bin
- Inserting Data to a Directory That Does Not Exist
- Creating Databases and Creating Tables in the Default Database Only as the Hive Administrator
- Disabling of Specifying the location Keyword When Creating an Internal Hive Table
- Enabling the Function of Creating a Foreign Table in a Directory That Can Only Be Read
- Authorizing Over 32 Roles in Hive
- Restricting the Maximum Number of Maps for Hive Tasks
- HiveServer Lease Isolation
- Hive Supports Isolation of Metastore instances Based on Components
- Switching the Hive Execution Engine to Tez
- Interconnecting Hive with External Self-Built Relational Databases
- Redis-based CacheStore of HiveMetaStore
- Hive Supporting Reading Hudi Tables
- Hive Supporting Cold and Hot Storage of Partitioned Metadata
- Hive Supporting ZSTD Compression Formats
- Hive Partitioned Tables Support OBS and HDFS Storage Sources
- Locating Abnormal Hive Files
- Using the ZSTD_JNI Compression Algorithm to Compress Hive ORC Tables
- HiveMetaStore Access Pressure Reduction
- Load Balancing for Hive MetaStore Client Connection
- Data Import and Export in Hive
- Enabling Hive Interception of Concurrent Overwrite
- Hive Log Overview
- Hive Performance Tuning
- Common Issues About Hive
- How Do I Delete UDFs on Multiple HiveServers at the Same Time?
- Why Cannot the DROP operation Be Performed on a Backed-up Hive Table?
- How to Perform Operations on Local Files with Hive User-Defined Functions
- How Do I Forcibly Stop MapReduce Jobs Executed by Hive?
- Table Creation Fails Because Hive Complex Fields' Names Contain Special Characters
- How Do I Monitor the Hive Table Size?
- How Do I Prevent Key Directories from Data Loss Caused by Misoperations of the insert overwrite Statement?
- Why Is Hive on Spark Task Freezing When HBase Is Not Installed?
- Error Reported When the WHERE Condition Is Used to Query Tables with Excessive Partitions in FusionInsight Hive
- Why Cannot I Connect to HiveServer When I Use IBM JDK to Access the Beeline Client?
- Why Cannot Data Be Queried After the MapReduce Engine Is Switched After the Tez Engine Is Used to Execute Union-related Statements?
- Why Does Hive Not Support Vectorized Query?
- How Do I Disable the Logging Function of Hive?
- Why Do Metastore Instances Failed to Be Restarted After They Are Deleted and Added Again?
- Hive Configuration Problems
- How Do I Handle the Error Reported When Setting hive.exec.stagingdir on the Hive Client?
- Load Data Command Failed to Import a Specified Wildcard Matching File to a Dynamic Partitioned Table
- How Do I Optimize the INSERT OVERWRITE for Reading and Writing in Same Table?
Using Hive from Scratch
Hive is a data warehouse framework built on Hadoop. It maps structured data files to a database table and provides SQL-like functions to analyze and process data. It also allows you to quickly perform simple MapReduce statistics using SQL-like statements without the need of developing a specific MapReduce application. It is suitable for statistical analysis of data warehouses.
Background
Suppose a user develops an application to manage users who use service A in an enterprise. The procedure of operating service A on the Hive client is as follows:
Operations on common tables:
- Create the user_info table.
- Add users' educational backgrounds and professional titles to the table.
- Query user names and addresses by user ID.
- Delete the user information table after service A ends.
ID |
Name |
Gender |
Age |
Address |
---|---|---|---|---|
12005000201 |
A |
Male |
19 |
City A |
12005000202 |
B |
Female |
23 |
City B |
12005000203 |
C |
Male |
26 |
City C |
12005000204 |
D |
Male |
18 |
City D |
12005000205 |
E |
Female |
21 |
City E |
12005000206 |
F |
Male |
32 |
City F |
12005000207 |
G |
Female |
29 |
City G |
12005000208 |
H |
Female |
30 |
City H |
12005000209 |
I |
Male |
26 |
City I |
12005000210 |
J |
Female |
25 |
City J |
Procedure
- Log in to the node where the client is installed as the client installation user and run the following command to switch to the client installation directory, for example, /opt/hadoopclient.
cd /opt/hadoopclient
- Run the following command to configure environment variables:
source bigdata_env
- If Kerberos authentication is enabled for the current cluster, run the following command to authenticate the current user. The current user must have the permission to create Hive tables. If Kerberos authentication is disabled for the current cluster, skip this step.
kinit MRS cluster user
For example, kinit hiveuser
- Run the Hive client command to implement service A.
- Operations on internal tables:
- Run the following command to log in to the Hive client CLI:
beeline
Hive allows you to add extension identifiers to JDBC connection strings. These extension identifiers are printed in HiveServer audit logs to distinguish SQL sources. You can concatenate the following to a URL:
auditAddition=xxx
xxx is the custom identifier. The identifier can contain a maximum of 256 bytes and only letters, digits, underscores (_), commas (,), and colons (:) are allowed.
For details about how to set the extension identifier using code, see the Hive Development Guide. The client connection can be set in either of the following ways:
- Modify the Client installation directory/Hive/component_env file, add \;auditAddition=xxx to the end of the CLIENT_HIVE_URI parameter, and run the source bigdata_env command again to apply the changes.
- When using a specified JDBC URL to connect to the Hive client, add ;auditAddition=xxx at the end of the URL. The following is an example:
- Create the user_info user information table according to Table 2-801 and add data to it.
create table user_info(id string,name string,gender string,age int,addr string);
insert into table user_info(id,name,gender,age,addr) values("12005000201","A","Male",19,"City A");
- Add users' education backgrounds and occupation to the user_info table.
For example, to add educational background and title information about user 12005000201, run the following command:
alter table user_info add columns(education string,technical string);
- Query user names and addresses by user ID.
For example, to query the name and address of user 12005000201, run the following command:
select name,addr from user_info where id='12005000201';
- Delete the user information table.
drop table user_info;
- Run the following command to exit:
!q
- Run the following command to log in to the Hive client CLI:
- Operations on external partitioned tables:
- Run the insert statement to inset data into an external table or run the load data command to import file data from HDFS to an external table. To run the load data command, perform the following operations:
- Create a file based on the data in Table 2-801. For example, the file name is txt.log. Fields are separated by space, and the line feed characters are used as the line breaks.
- Run the following command to upload the file to the HDFS directory, for example, /tmp:
hdfs dfs -put txt.log /tmp
- Create a path for storing external table data.
hdfs dfs -mkdir /hive/
hdfs dfs -mkdir /hive/user_info
- Run the following command to log in to the Hive client CLI:
beeline
Hive allows you to add extension identifiers to JDBC connection strings. These extension identifiers are printed in HiveServer audit logs to distinguish SQL sources. You can concatenate the following to a URL:
auditAddition=xxx
xxx is the custom identifier. The identifier can contain a maximum of 256 bytes and only letters, digits, underscores (_), commas (,), and colons (:) are allowed.
For details about how to set the extension identifier using code, see the Hive Development Guide. The client connection can be set in either of the following ways:
- Modify the Client installation directory/Hive/component_env file, add \;auditAddition=xxx to the end of the CLIENT_HIVE_URI parameter, and run the source bigdata_env command again to apply the changes.
- When using a specified JDBC URL to connect to the Hive client, add ;auditAddition=xxx at the end of the URL. The following is an example:
- Run the following statement to create a table:
create external table user_info(id string,name string,gender string,age int,addr string) partitioned by(year string) row format delimited fields terminated by ' ' lines terminated by '\n' stored as textfile location '/hive/user_info';
fields terminated indicates delimiters, for example, spaces.
lines terminated indicates line breaks, for example, \n.
/hive/user_info indicates the path of the data file.
- Import data.
- Run the insert statement to insert data. The following describes how to insert information about the user 12005000201. The procedure for inserting information about other users is similar.
insert into user_info partition(year="2018") values ("12005000201","A","Male",19,"City A");
- Run the load data command to import file data.
load data inpath '/tmp/txt.log' into table user_info partition (year='2011');
In the preceding command, /tmp/txt.log indicates the data file uploaded to the HDFS in 4.a.
- Run the insert statement to insert data. The following describes how to insert information about the user 12005000201. The procedure for inserting information about other users is similar.
- Query the imported data.
select * from user_info;
- Delete the user information table.
drop table user_info;
- Run the following command to exit:
!q
- Run the insert statement to inset data into an external table or run the load data command to import file data from HDFS to an external table. To run the load data command, perform the following operations:
- Operations on internal tables:
Configuring Hive Parameters
Navigation Path
Go to the Hive configurations page by referring to Modifying Cluster Service Configuration Parameters.
Parameter Description
Parameter |
Description |
Default Value |
---|---|---|
hive.auto.convert.join |
Whether Hive converts common join to mapjoin based on the input file size. NOTE:
When Hive is used to query a join table, whatever the table size is (if the data in the join table is less than 24 MB, it is a small one), set this parameter to false. If this parameter is set to true, new mapjoin cannot be generated when you query a join table. |
Possible values are as follows:
The default value is true. |
hive.default.fileformat |
Indicates the default file format used by Hive. |
RCFile |
hive.exec.reducers.max |
Indicates the maximum number of reducers in a MapReduce job submitted by Hive. |
999 |
hive.server2.thrift.max.worker.threads |
Indicates the maximum number of threads that can be started in the HiveServer internal thread pool. |
1,000 |
hive.server2.thrift.min.worker.threads |
Indicates the number of threads started during initialization in the HiveServer internal thread pool. |
5 |
hive.hbase.delete.mode.enabled |
Indicates whether to enable the function of deleting HBase records from Hive. If this function is enabled, you can use remove table xx where xxx to delete HBase records from Hive. |
true |
hive.metastore.server.min.threads |
Indicates the number of threads started by MetaStore for processing connections. If the number of threads is more than the set value, MetaStore always maintains a number of threads that is not lower than the set value, that is, the number of resident threads in the MetaStore thread pool is always higher than the set value. |
200 |
hive.server2.enable.doAs |
Indicates whether to simulate client users during sessions between HiveServer2 and other services (such as Yarn and HDFS). If you change the configuration item from false to true, users with only the column permission lose the permissions to access corresponding tables. |
true |
Hive SQL
Hive SQL supports all features in Hive-3.1.0. For details, visit https://cwiki.apache.org/confluence/display/hive/languagemanual.
Table 2-803 describes the extended Hive statements provided by MRS.
Extended Syntax |
Syntax Description |
Syntax Example |
Example Description |
---|---|---|---|
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name (col_name data_type [COMMENT col_comment], ...) [ROW FORMAT row_format] [STORED AS file_format] | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...) ] ...... [TBLPROPERTIES ("groupId"=" group1 ","locatorId"="locator1")] ...; |
The statement is used to create a Hive table and specify locators on which table data files locate. For details, see "Component Operation Guide" > "Using Hive" > "Using HDFS Colocation to Store Hive Tables" in MapReduce Service (MRS) 3.3.1-LTS User Guide (for Huawei Cloud Stack 8.3.1) in the MapReduce Service (MRS) 3.3.1-LTS Usage Guide (for Huawei Cloud Stack 8.3.1). |
CREATE TABLE tab1 (id INT, name STRING) row format delimited fields terminated by '\t' stored as RCFILE TBLPROPERTIES("groupId"=" group1 ","locatorId"="locator1"); |
The statement is used to create table tab1 and specify locator1 on which the table data of tab1 locates. |
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name (col_name data_type [COMMENT col_comment], ...) [ROW FORMAT row_format] [STORED AS file_format] | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...) ] ... [TBLPROPERTIES ('column.encode.columns'='col_name1,col_name2'| 'column.encode.indices'='col_id1,col_id2', 'column.encode.classname'='encode_classname')]...; |
The statement is used to create a hive table and specify the table encryption column and encryption algorithm. For details, see "Component Operation Guide" > "Using Hive" > "Using the Hive Column Encryption Function" in MapReduce Service (MRS) 3.3.1-LTS User Guide (for Huawei Cloud Stack 8.3.1) in the MapReduce Service (MRS) 3.3.1-LTS Usage Guide (for Huawei Cloud Stack 8.3.1). |
create table encode_test(id INT, name STRING, phone STRING, address STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES ('column.encode.indices'='2,3', 'column.encode.classname'='org.apache.hadoop.hive.serde2.SMS4Rewriter') STORED AS TEXTFILE; |
The statement is used to create table encode_test and specify that column 2 and column 3 will be encrypted using the org.apache.hadoop.hive.serde2.SMS4Rewriter encryption algorithm class during data insertion. |
REMOVE TABLE hbase_tablename [WHERE where_condition]; |
The statement is used to delete data that meets criteria from the Hive on HBase table. For details, see "Component Operation Guide" > "Using Hive" > "Deleting Single-Row Records from Hive on HBase" in MapReduce Service (MRS) 3.3.1-LTS User Guide (for Huawei Cloud Stack 8.3.1) in the MapReduce Service (MRS) 3.3.1-LTS Usage Guide (for Huawei Cloud Stack 8.3.1). |
remove table hbase_table1 where id = 1; |
The statement is used to delete data that meets the criterion of "id = 1" from the table. |
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name (col_name data_type [COMMENT col_comment], ...) [ROW FORMAT row_format] STORED AS inputformat 'org.apache.hadoop.hive.contrib.fileformat.SpecifiedDelimiterInputFormat' outputformat 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'; |
The statement is used to create a hive table and specify that the table supports customized row delimiters. For details, see "Component Operation Guide" > "Using Hive" > "Customizing Row Separators" in MapReduce Service (MRS) 3.3.1-LTS User Guide (for Huawei Cloud Stack 8.3.1) in the MapReduce Service (MRS) 3.3.1-LTS Usage Guide (for Huawei Cloud Stack 8.3.1). |
create table blu(time string, num string, msg string) row format delimited fields terminated by ',' stored as inputformat 'org.apache.hadoop.hive.contrib.fileformat.SpecifiedDelimiterInputFormat' outputformat 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'; |
The statement is used to create table blu and set inputformat to SpecifiedDelimiterInputFormat so that the query row delimiter can be specified during the query. |
Permission Management
Hive Permission
Hive is a data warehouse framework built on Hadoop. It provides basic data analysis services using the Hive query language (HQL), a language like the structured query language (SQL).
MRS supports users, user groups, and roles. Permissions must be assigned to roles and then roles are bound to users or user groups. Users can obtain permissions only by binding a role or joining a group that is bound with a role. For details about Hive authorization, visit https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Authorization.
- Hive permissions in security mode need to be managed whereas those in normal mode do not.
- If the current component uses Ranger for permission control, you need to configure permission management policies based on Ranger. For details, see Adding a Ranger Access Permission Policy for Hive.
Hive Permission Model
To use the Hive component, users must have permissions on Hive databases and tables (including external tables and views). In MRS, the complete Hive permission model is composed of Hive metadata permission and HDFS file permission. The Hive permission model also includes the permission to use databases or tables.
- Hive metadata permission
Similar to traditional relational databases, the Hive database of MRS supports the CREATE and SELECT permission, and the Hive tables and columns support the SELECT, INSERT, and DELETE permissions. Hive also supports the permissions of OWNERSHIP and Hive Admin Privilege.
- Hive data file permission, also known as HDFS file permission
Hive database and table files are stored in the HDFS. The created databases or tables are saved in the /user/hive/warehouse directory of the HDFS by default. The system automatically creates subdirectories named after database names and database table names. To access a database or a table, the corresponding file permissions (read, write, and execute) on the HDFS are required.
In the multi-instance scenario, the directory is /user/hiven n (n=1–4)/warehouse.
To perform various operations on Hive databases or tables, you need to associate the metadata permission with the HDFS file permission. For example, to query Hive data tables, you need to associate the metadata permission SELECT and the HDFS file permissions Read and Write.
To use the role management function of Manager GUI to manage the permissions of Hive databases and tables, you only need to configure the metadata permission, and the system will automatically associate and configure the HDFS file permission. In this way, operations on the interface are simplified, and the efficiency is improved.
Hive Users
MRS provides users and roles to use Hive, such as creating tables, inserting data into tables, and querying tables. Hive defines the USER class, corresponding to user instances. Hive defines the GROUP class, corresponding to role instances.
You can use Manager to set permissions for Hive users. This method only supports permission setting in roles. A user or user group can obtain the permissions only after a role is bound to the user or user group. Hive users can be granted Hive administrator permissions and permissions to access databases, tables, and columns.
Support for Cascading Authorization.
Hive tables in a cluster with Ranger authentication enabled support cascading authorization, which significantly improves the authentication usability. You only need to authorize for service tables once on the Ranger page, and the background automatically associates the permissions of the data storage source in a fine-grained manner without detecting the storage path of the tables and without requiring secondary authorization. This also eliminates the disadvantage of authorization based on decoupled storage and compute. For details, see Hive Tables Supporting Cascading Authorization.
Hive Usage Scenarios and Related Permissions
Creating a database with Hive requires users to join in the hive group, without granting a role. Users have all permissions on the databases or tables created by themselves in Hive or HDFS. They can create tables, select, delete, insert, or update data, and grant permissions to other users to allow them to access the tables and corresponding HDFS directories and files.
A user can access the tables or database only with permissions. The permission required by users varies according to Hive usage scenarios.
Typical Scenario |
Permission |
---|---|
Using Hive tables, columns, or databases |
Permissions required in different scenarios are as follows:
|
Associating and using other components |
In addition to Hive permissions, permissions of other components are required in some scenarios, for example:
|
In some special Hive usage scenarios, you need to configure other types of permission.
Scenario |
Permission |
---|---|
Creating Hive databases, tables, and external tables, or adding partitions to created Hive tables or external tables when data files specified by Hive users are saved to other HDFS directories except /user/hive/warehouse |
The directory must already exist, the Hive user must be the owner of the directory, and the Hive user must have the read, write, and execute permissions on the directory. The user must have the read and write permissions of all the upper-layer directories of the directory. After an administrator grants the Hive permission to the role, the HDFS permission is automatically granted. |
Using load to load data from all the files or specified files in a specified directory to Hive tables as a Hive user |
NOTE:
When load is used to import data to a Linux local disk, files must be loaded to the HiveServer on which the command is run and the permission must be modified. You are advised to run the command on a client. The HiveSever to which the client is connected can be found. For example, if the Hive client displays 0: jdbc:hive2://10.172.0.43:21066/>, the IP address of the connected HiveServer is 10.172.0.43. |
Creating or deleting functions or modifying any database |
The Hive Admin Privilege is required. |
Performing operations on all databases and tables in Hive |
The user must be added to the supergroup user group and granted Hive Admin Privilege. |
Creating a Hive Role
Scenario
Create and configure a Hive role on Manager as an MRS cluster administrator. The Hive role can be granted the permissions of the Hive administrator and the permissions to operate Hive table data.
Creating a database with Hive requires users to join in the hive group, without granting a role. Users have all permissions on the databases or tables created by themselves in Hive or HDFS. They can create tables, select, delete, insert, or update data, and grant permissions to other users to allow them to access the tables and corresponding HDFS directories and files. The created databases or tables are saved in the /user/hive/warehouse directory of the HDFS by default.
- A Hive role can be created only in security mode.
- If the current component uses Ranger for permission control, you need to configure permission management policies based on Ranger. For details, see Adding a Ranger Access Permission Policy for Hive.
Prerequisites
- The MRS cluster administrator has understood service requirements.
- Log in to FusionInsight Manager. For details, see Logging In to FusionInsight Manager.
- The Hive client has been installed. For details, see Installing a Client.
Procedure
- Log in to FusionInsight Manager.
- Choose System > Permission > Role.
- Click Create Role, and set Role Name and Description.
- Set Configure Resource Permission. For details, see Table 2-806.
- Grant the read and execution permissions for the HDFS directory.
- Click Name of the desired cluster and select HDFS for Service Name. On the displayed page, click File System, choose hdfs://hacluster/ > user, locate the row where hive is located, and select Read and Execute in the Permission column.
- Click Name of the desired cluster and select HDFS for Service Name. On the displayed page, click File System, choose hdfs://hacluster/ > user > hive, locate the row where warehouse is located, and select Read and Execute in the Permission column.
- Click Name of the desired cluster and select HDFS for Service Name. On the displayed page, click File System, choose hdfs://hacluster/ > tmp, locate the row where hive-scratch is located, and select Read and Execute in the Permission column.
- Hive Admin Privilege: Hive administrator permission.
- Hive Read Write Privileges: Hive data table management permission, which is the operation permission to set and manage the data of created tables.
- Hive role management supports the administrator permissions and the permission of accessing databases, tables, and views.
- The permissions of the Hive administrator do not include the permission to manage HDFS.
- If there are too many tables in the database or too many files in tables, the permission granting may last a while. For example, if a table contains 10,000 files, the permission granting lasts about 2 minutes.
Table 2-806 Setting a roleTask
Role Authorization
Setting the Hive administrator permission
In the Configure Resource Permission table, choose Name of the desired cluster > Hive and select Hive Admin Privilege.
NOTE:After being bound to the Hive administrator role, perform the following operations during each maintenance operation:
- Log in to the node where the Hive client is installed as the client installation user.
- Run the following command to configure environment variables:
For example, if the Hive client installation directory is /opt/hiveclient, run source /opt/hiveclient/bigdata_env.
- Run the following command to authenticate the user:
kinit Hive service user
- Run the following command to log in to the client tool:
beeline
- Run the following command to update the administrator permissions of the Hive user:
set role admin;
Setting the permission to query a table of another user in the default database
- In the Configure Resource Permission table, choose Name of the desired cluster > Hive > Hive Read Write Privileges.
- Click the name of the specified database in the database list. Tables in the database are displayed.
- In the Rights column of the specified table, choose Select.
Setting the permission to query a table of another user in the default database
- In the Configure Resource Permission table, choose Name of the desired cluster > Hive > Hive Read Write Privileges.
- Click the name of the specified database in the database list. Tables in the database are displayed.
- In the Permission column of the specified table, select INSERT.
Setting the permission to import data to a table of another user in the default database
- In the Configure Resource Permission table, choose Name of the desired cluster > Hive > Hive Read Write Privileges.
- Click the name of the specified database in the database list. Tables in the database are displayed.
- In the Permission column of the specified indexes, select DELETE and INSERT.
Setting the permission to submit HQL commands to Yarn for execution
The HQL commands used by some services are converted into MapReduce tasks and submitted to Yarn for execution. You need to set the Yarn permissions. For example, the HQL statements to be run use statements, such as insert, count, distinct, group by, order by, sort by, or join.
- In the Permission table, choose Name of the desired cluster > Yarn > Scheduling Queue > root.
- In the Permission column of the default queue, select Submit.
- Grant the read and execution permissions for the HDFS directory.
- Click OK, and return to the Role page.
Configuring Permissions for Hive Tables, Columns, or Databases
Scenario
You can configure related permissions if you need to access tables or databases created by other users. Hive supports column-based permission control. If a user needs to access some columns in tables created by other users, the user must be granted the permission for columns. The following describes how to grant table, column, and database permissions to users by using the role management function of MRS Manager.
- You can configure permissions for Hive tables, columns, or databases only in security mode.
- If the current component uses Ranger for permission control, you need to configure permission management policies based on Ranger. For details, see Adding a Ranger Access Permission Policy for Hive.
Prerequisites
- You have obtained a user account with the administrator permissions, such as admin.
- You have created a role, for example, hrole, on Manager by referring to instructions in Creating a Hive Role. You do not need to set the Hive permission but need to set the permission to submit the HQL command to Yarn for execution.
- You have created two Hive human-machine users, such as huser1 and huser2, on Manager and added them to the hive group. huser2 has been bound to hrole. The hdb database has created by user huser1 and the htable table has been created in the database.
Procedure
- Granting Table Permissions
Users have complete permission on the tables created by themselves in Hive and the HDFS. To access the tables created by others, they need to be granted the permission. After the Hive metadata permission is granted, the HDFS permission is automatically granted. The procedure for granting a role the permission of querying, inserting, and deleting htable data is as follows:
- On FusionInsight Manager, choose System > Permission > Role.
- Locate the row that contains hrole, and click Modify.
- Choose Name of the desired cluster > Hive > Hive Read Write Privileges.
- Click the name of the specified database hdb in the database list. Table htable in the database is displayed.
- In the Permission column of the htable table, select SELECT, INSERT, and DELETE.
- Click OK.
In role management, the procedure for granting a role the permission of querying, inserting, and deleting Hive external table data is the same. After the metadata permission is granted, the HDFS permission is automatically granted.
- Granting Column Permissions
Users have all permissions for the tables created by themselves in Hive and HDFS. Users do not have the permission to access the tables created by others. If a user needs to access some columns in tables created by other users, the user must be granted the permission for columns. After the Hive metadata permission is granted, the HDFS permission is automatically granted. The procedure for granting a role the permission of querying and inserting data in hcol of htable is as follows:
- On FusionInsight Manager, choose System > Permission > Role.
- Locate the row that contains hrole, and click Modify.
- Choose Name of the desired cluster > Hive > Hive Read Write Privileges.
- In the database list, click the specified database hdb to display the htable table in the database. Click the htable table to display the hcol column in the table.
- In the Permission column of the hcol column, select SELECT and INSERT.
- Click OK.
In role management, after the metadata permission is granted, the HDFS permission is automatically granted. Therefore, after the column permission is granted, the HDFS ACL permission for all files of the table is automatically granted.
- Granting Database Permissions
Users have complete permission on the databases created by themselves in Hive and the HDFS. To access the databases created by others, they need to be granted the permission. After the Hive metadata permission is granted, the HDFS permission is automatically granted. The procedure for granting a role the permission of querying data and creating tables in database hdb is as follows. Other types of database operation permission are not supported.
- On FusionInsight Manager, choose System > Permission > Role.
- Locate the row that contains hrole, and click Modify.
- Choose Name of the desired cluster > Hive > Hive Read Write Privileges.
- In the Permission column of the hdb database, select SELECT and CREATE.
- Click OK.
- Any permission for a table in the database is automatically associated with the HDFS permission for the database directory to facilitate permission management. When any permission for a table is canceled, the system does not automatically cancel the HDFS permission for the database directory to ensure performance. In this case, users can only log in to the database and view table names.
- When the query permission on a database is added to or deleted from a role, the query permission on tables in the database is automatically added to or deleted from the role.
- If the number of partitions in the database exceeds one million and all partitions are in the table directory, to accelerate granting permissions to the database, log in to FusionInsight Manager, click Cluster and choose Services > Hive . On the page that is displayed, click Configurations and then All Configurations. In the navigation pane on the left, choose MetaStore(Role) > Customization, add the hive-ext.skip.grant.partition parameter, and set it to true. After this parameter is added, partition scanning is skipped when permissions are granted to the database. You need to restart the MetaStore instance for the modification to take effect.
- If there are too many tables in a database, granting permissions to the database will take a long time. To accelerate granting permissions, log in to FusionInsight Manager, click Cluster and choose Services > Hive . On the page that is displayed, click the Configurations tab and then All Configurations. In the navigation pane on the left, choose MetaStore(Role) > Customization, add the metastore.dbpriv.allow.rollback parameter, and set it to true. If an exception occurs during the permission gran, a rollback will be performed.
Note: After adding a custom parameter, you need to restart the Metastore instance. During the permission grant process, table information in the database is cached, which occupies memory. Therefore, ensure that the Metastore heap memory is sufficient.
Concepts
Scenario |
Required Permission |
---|---|
DESCRIBE TABLE |
SELECT |
SHOW PARTITIONS |
SELECT |
ANALYZE TABLE |
SELECT and INSERT |
SHOW COLUMNS |
SELECT |
SHOW TABLE STATUS |
SELECT |
SHOW TABLE PROPERTIES |
SELECT |
SELECT |
SELECT |
EXPLAIN |
SELECT |
CREATE VIEW |
SELECT, Grant Of Select, and CREATE |
SHOW CREATE TABLE |
SELECT and Grant Of Select |
CREATE TABLE |
CREATE |
ALTER TABLE ADD PARTITION |
INSERT |
INSERT |
INSERT |
INSERT OVERWRITE |
INSERT and DELETE |
LOAD |
INSERT and DELETE |
ALTER TABLE DROP PARTITION |
DELETE |
CREATE FUNCTION |
Hive Admin Privilege |
DROP FUNCTION |
Hive Admin Privilege |
ALTER DATABASE |
Hive Admin Privilege |
Configuring Permissions to Use Other Components for Hive
Scenario
Hive may need to be associated with other components. For example, Yarn permissions are required in the scenario of using HQL statements to trigger MapReduce jobs, and HBase permissions are required in the Hive over HBase scenario. The following describes the operations in the two scenarios.
- In security mode, Yarn and HBase permission management is enabled by default. Therefore, Yarn and HBase permissions need to be configured by default.
- In common mode, Yarn and HBase permission management is disabled by default. That is, any user has permissions. Therefore, YARN and HBase permissions does not need to be configured by default. If a user enables the permission management by modifying the Yarn or HBase configurations, the Yarn and HBase permissions then need to be configured.
- If the current component uses Ranger for permission control, you need to configure permission management policies based on Ranger. For details, see Adding a Ranger Access Permission Policy for Hive.
Prerequisites
- The Hive client has been installed. For details, see Installing a Client. For example, the installation directory is /opt/hadoopclient.
- You have obtained a user account with the administrator permissions, such as admin.
Procedure
Association with Yarn
Yarn permissions are required when HQL statements, such as insert, count, distinct, group by, order by, sort by, and join, are used to trigger MapReduce jobs. The following uses the procedure for assigning a role the permissions to run the count statements in the thc table as an example.
- Create a role on FusionInsight Manager.
- In the Configure Resource Permission table, choose Name of the desired cluster > Yarn > Scheduler Queue > root.
- In the Permission column of the default queue, select Submit and click OK.
- In the Configure Resource Permission table, choose Name of the desired cluster > Hive > Hive Read Write Privileges > default. Select SELECT for table thc, and click OK.
Hive over HBase Authorization
After the permissions are assigned, you can use HQL statements that are similar to SQL statements to access HBase tables from Hive. The following uses the procedure for assigning a user the rights to query HBase tables as an example.
- On the role management page of FusionInsight Manager, create an HBase role, for example, hive_hbase_create, and grant the permission to create HBase tables.
In the Configure Resource Permission table, choose Name of the desired cluster > HBase > HBase Scope > global. Select Create of the namespace default, and click OK.
- On FusionInsight Manager, create a human-machine user, for example, hbase_creates_user, add the user to the hive group, and bind the hive_hbase_create role to the user so that the user can create Hive and HBase tables.
- If the current component uses Ranger for permission control, grant the create permission for hive_hbase_create or hbase_creates_user. For details, see Adding a Ranger Access Permission Policy for Hive.
- Log in to the node where the client is installed as the client installation user.
- Run the following command to configure environment variables:
source /opt/hadoopclient/bigdata_env
- Run the following command to authenticate the user:
kinit hbase_creates_user
- Run the following command to go to the shell environment of the Hive client:
beeline
- Run the following command to create a table in Hive and HBase, for example, the thh table.
CREATE TABLE thh(id int, name string, country string) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES("hbase.columns.mapping" = "cf1:id,cf1:name,:key") TBLPROPERTIES ("hbase.table.name" = "thh");
The created Hive table and the HBase table are stored in the Hive database default and the HBase namespace default, respectively.
- On the role management page of FusionInsight Manager, create a role, for example, hive_hbase_select, and assign the role the permission to query the Hive table thh and the HBase table thh.
- In the Configure Resource Permission table, choose Name of the desired cluster > HBase > HBase Scope > global > default. Select read of the thh table, and click OK to grant the table query permission to the HBase role.
- Edit the role. In the Configure Resource Permission table, choose Name of the desired cluster > HBase > HBase Scope > global > hbase, select Execute for hbase:meta, and click OK.
- Edit the role. In the Configure Resource Permission table, choose Name of the desired cluster > Hive > Hive Read Write Privileges > default. Select SELECT for the thh table, and click OK.
- On FusionInsight Manager, create a human-machine user, for example, hbase_select_user, add the user to the hive group, and bind the hive_hbase_select role to the user so that the user can query Hive and HBase tables.
- Run the following command to configure environment variables:
source /opt/hadoopclient/bigdata_env
- Run the following command to authenticate users:
kinit hbase_select_user
- Run the following command to go to the shell environment of the Hive client:
beeline
- Run the following command to use an HQL statement to query HBase table data:
select * from thh;
Using a Hive Client
Scenario
This section guides users to use a Hive client in an O&M or service scenario. For details about shell commands, see MapReduce Service (MRS) 3.3.1-LTS Shell O&M Commands (for Huawei Cloud Stack 8.3.1).
Prerequisites
- The client has been installed. For details, see Installing a Client. For example, the client is installed in the /opt/hadoopclient directory. The client directory in the following operations is only an example. Change it to the actual installation directory.
- Service component users have been created by the MRS cluster administrator by referring to Creating a User. In security mode, machine-machine users need to download the keytab file. For details, see Exporting an Authentication Credential File. A human-machine user must change the password upon the first login.
Using the Hive Client
- Log in to the node where the client is installed as the client installation user.
- Run the following command to go to the client installation directory:
cd /opt/hadoopclient
- Run the following command to configure environment variables:
source bigdata_env
- If you use the client to connect to a specific Hive instance in a scenario when multiple Hive instances are installed, run the following command to load the environment variables of the instance. Otherwise, skip this step. For example, load the environment variables of the Hive2 instance.
source Hive2/component_env
- Log in to the Hive client based on the cluster authentication mode.
- In security mode, run the following command to complete user authentication and log in to the Hive client:
kinit Component service user
beeline
- In common mode, run the following command to log in to the Hive client. If no component service user is specified, the current OS user is used to log in to the Hive client.
beeline -n component service user
- In security mode, run the following command to complete user authentication and log in to the Hive client:
- Run the following command to execute the HCatalog client command:
hcat -e "cmd"
cmd must be a Hive DDL statement, for example, hcat -e "show tables".
- To use the HCatalog client, choose More > Download Client on the service page to download the clients of all services. This restriction does not apply to the beeline client.
- Due to permission model incompatibility, tables created using the HCatalog client cannot be accessed on the HiveServer client. However, the tables can be accessed on the WebHCat client.
- If you use the HCatalog client in Normal mode, the system performs DDL commands using the current user who has logged in to the operating system.
- Exit the beeline client by running the !q command instead of by pressing Ctrl + C. Otherwise, the temporary files generated by the connection cannot be deleted and a large number of junk files will be generated as a result.
- If multiple statements need to be entered during the use of beeline clients, separate the statements from each other using semicolons (;) and set the value of entireLineAsCommand to false.
Setting method: If beeline has not been started, run the beeline --entireLineAsCommand=false command. If the beeline has been started, run the !set entireLineAsCommand false command.
After the setting, if a statement contains semicolons (;) that do not indicate the end of the statement, escape characters must be added, for example, select concat_ws('\;', collect_set(col1)) from tbl.
Common Hive Client Commands
The following table lists common Hive Beeline commands.
For more commands, see https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients#HiveServer2Clients-BeelineCommands.
Command |
Description |
---|---|
set <key>=<value> |
Sets the value of a specific configuration variable (key). NOTE:
If the variable name is incorrectly spelled, the Beeline does not display an error. |
set |
Prints the list of configuration variables overwritten by users or Hive. |
set -v |
Prints all configuration variables of Hadoop and Hive. |
add FILE[S] <filepath> <filepath>* add JAR[S] <filepath> <filepath>* add ARCHIVE[S] <filepath> <filepath>* |
Adds one or more files, JAR files, or ARCHIVE files to the resource list of the distributed cache. |
add FILE[S] <ivyurl> <ivyurl>* add JAR[S] <ivyurl> <ivyurl>* add ARCHIVE[S] <ivyurl> <ivyurl>* |
Adds one or more files, JAR files, or ARCHIVE files to the resource list of the distributed cache using the lvy URL in the ivy://goup:module:version?query_string format. |
list FILE[S] list JAR[S] list ARCHIVE[S] |
Lists the resources that have been added to the distributed cache. |
list FILE[S] <filepath>* list JAR[S] <filepath>* list ARCHIVE[S] <filepath>* |
Checks whether given resources have been added to the distributed cache. |
delete FILE[S] <filepath>* delete JAR[S] <filepath>* delete ARCHIVE[S] <filepath>* |
Deletes resources from the distributed cache. |
delete FILE[S] <ivyurl> <ivyurl>* delete JAR[S] <ivyurl> <ivyurl>* delete ARCHIVE[S] <ivyurl> <ivyurl>* |
Delete the resource added using <ivyurl> from the distributed cache. |
reload |
Enable HiveServer2 to discover the change of the JAR file hive.reloadable.aux.jars.path in the specified path. (You do not need to restart HiveServer2.) Change actions include adding, deleting, or updating JAR files. |
dfs <dfs command> |
Runs the dfs command. |
<query string> |
Executes the Hive query and prints the result to the standard output. |
Using HDFS Colocation to Store Hive Tables
Scenario
HDFS Colocation is the data location control function provided by HDFS. The HDFS Colocation API stores associated data or data on which associated operations are performed on the same storage node. Hive supports the HDFS Colocation function. When Hive tables are created, after the locator information is set for table files, data files of related tables are stored on the same storage node when data is inserted into tables using the insert statement (other data import modes are not supported). This ensures convenient and efficient data computing among associated tables. The supported table formats are only TextFile and RCFile.
Procedure
- Log in to the node where the client is installed as a client installation user.
- Run the following command to switch to the client installation directory, for example, /opt/hadoopclient:
cd /opt/hadoopclient
- Run the following command to configure environment variables:
source bigdata_env
- If the cluster is in security mode, run the following command to authenticate the user:
kinit MRS username
- Create the groupid through the HDFS API.
hdfs colocationadmin -createGroup -groupId <groupid> -locatorIds <locatorid1>,<locatorid2>,<locatorid3>
In the preceding command, <groupid> indicates the name of the created group. The group created in this example contains three locators. You can define the number of locators as required.
For details about group ID creation and HDFS Colocation, see HDFS description.
- Run the following command to log in to the Hive client:
beeline
- Enable Hive to use colocation.
Assume that table_name1 and table_name2 are associated with each other. Run the following statements to create them:
CREATE TABLE <[db_name.]table_name1>[(col_name data_type , ...)] [ROW FORMAT <row_format>] [STORED AS <file_format>] TBLPROPERTIES("groupId"=" <group> ","locatorId"="<locator1>");
CREATE TABLE <[db_name.]table_name2> [(col_name data_type , ...)] [ROW FORMAT <row_format>] [STORED AS <file_format>] TBLPROPERTIES("groupId"=" <group> ","locatorId"="<locator1>");
After data is inserted into table_name1 and table_name2 using the insert statement, data files of table_name1 and table_name2 are distributed to the same storage position in the HDFS, facilitating associated operations among the two tables.
Using the Hive Column Encryption Function
Scenario
Hive supports encryption of one or multiple columns in a table. When creating a Hive table, you can specify the column to be encrypted and encryption algorithm. When data is inserted into the table using the insert statement, the related columns are encrypted. Column encryption can be performed in HDFS tables of only the TextFile and SequenceFile file formats. The Hive column encryption does not support views and the Hive over HBase scenario.
Hive supports two column encryption algorithms, which can be specified during table creation:
- AES (the encryption class is org.apache.hadoop.hive.serde2.AESRewriter)
- SMS4 (the encryption class is org.apache.hadoop.hive.serde2.SMS4Rewriter)
- In national cryptographic cluster scenarios, Hive column encryption supports only table creation using the SMS4 algorithm.
- When you import data from a common Hive table into a Hive column encryption table, you are advised to delete the original data from the common Hive table as long as doing this does not affect other services. Retaining an unencrypted table poses security risks.
Procedure
- Specify the column to be encrypted and encryption algorithm when creating a table.
create table<[db_name.]table_name> (<col_name1> <data_type> ,<col_name2> <data_type>,<col_name3> <data_type>,<col_name4> <data_type>) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES ('column.encode.columns'='<col_name2>,<col_name3>', 'column.encode.classname'='org.apache.hadoop.hive.serde2.AESRewriter')STORED AS TEXTFILE;
Alternatively, use the following statement:
create table <[db_name.]table_name> (<col_name1> <data_type> ,<col_name2> <data_type>,<col_name3> <data_type>,<col_name4> <data_type>) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES ('column.encode.indices'='1,2', 'column.encode.classname'='org.apache.hadoop.hive.serde2.SMS4Rewriter') STORED AS TEXTFILE;
- The numbers used to specify encryption columns start from 0. 0 indicates column 1, 1 indicates column 2, and so on.
- When creating a table with encrypted columns, ensure that the directory where the table resides is empty.
- Insert data into the table using the insert statement.
Assume that the test table exists and contains data.
insert into table <table_name> select <col_list> from test;
Customizing Row Separators
Scenario
In most cases, a carriage return character is used as the row delimiter in Hive tables stored in text files, that is, the carriage return character is used as the terminator of a row during queries. However, some data files are delimited by special characters, and not a carriage return character.
MRS Hive allows you to use different characters or character combinations to delimit rows of Hive text data. When creating a table, set inputformat to SpecifiedDelimiterInputFormat, and set the following parameter before search each time. Then the table data is queried by the specified delimiter.
set hive.textinput.record.delimiter='';
The Hue component of the current version does not support the configuration of multiple separators when files are imported to a Hive table.
Procedure
- Specify inputFormat and outputFormat when creating a table.
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name [(col_name data_type [COMMENT col_comment], ...)] [ROW FORMAT row_format] STORED AS inputformat 'org.apache.hadoop.hive.contrib.fileformat.SpecifiedDelimiterInputFormat' outputformat 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';
- Specify the delimiter before search.
set hive.textinput.record.delimiter='!@!';
Hive will use '!@!' as the row delimiter.
Deleting Single-Row Records from Hive on HBase
Scenario
Due to the limitations of underlying storage systems, Hive does not support the ability to delete a single piece of table data. In Hive on HBase, MRS Hive supports the ability to delete a single piece of HBase table data. Using a specific syntax, Hive can delete one or more pieces of data from an HBase table.
Cluster Authentication Mode |
Required Permission |
---|---|
Security mode |
SELECT, INSERT, and DELETE |
Common mode |
None |
Procedure
- To delete some data from an HBase table, run the following HQL statement:
remove table <table_name> where <expression>;
In the preceding information, <expression> specifies the filter condition of the data to be deleted. <table_name> indicates the Hive on HBase table from which data is to be deleted.
Configuring HTTPS/HTTP-based REST APIs
Scenario
WebHCat provides external REST APIs for Hive. By default, the open-source community version uses the HTTP protocol.
MRS Hive supports the HTTPS protocol that is more secure, and enables switchover between the HTTP protocol and the HTTPS protocol.
The security mode supports HTTPS and HTTP, and the common mode supports only HTTP.
Procedure
- Log in to FusionInsight Manager. For details, see Accessing FusionInsight Manager of an MRS Cluster. Click Cluster, choose Services > Hive, click Configurations, and then All Configurations.
- Choose . On the page that is displayed, select HTTPS or HTTP. After the modification, restart the Hive service to use the corresponding protocol.
Enabling or Disabling the Transform Function
Scenario
The Transform function is not allowed by Hive of the open source version.
MRS Hive supports the configuration of the Transform function. The function is disabled by default, which is the same as that of the open-source community version.
Users can modify configurations of the Transform function to enable the function. However, security risks exist when the Transform function is enabled.
The Transform function can be disabled only in security mode.
Procedure
- Log in to FusionInsight Manager. For details, see Accessing FusionInsight Manager of an MRS Cluster. Click Cluster, choose Services > Hive, click Configurations, and then All Configurations.
- Enter the parameter name in the search box, search for hive.security.transform.disallow, change the parameter value to true or false, and restart all HiveServer instances.
- If this parameter is set to true, the Transform function is disabled, which is the same as that in the open-source community version.
- If this parameter is set to false, the Transform function is enabled, which poses security risks.
Access Control of a Dynamic Table View on Hive
Scenario
This section describes how to create a view on Hive when MRS is configured in security mode, authorize access permissions to different users, and specify that different users access different data.
In the view, Hive can obtain the built-in function current_user() of the users who submit tasks on the client and filter the users. This way, authorized users can only access specific data in the view.
- In normal mode, the current_user() function cannot distinguish users who submit tasks on the client. Therefore, the access control function takes effect only for Hive in security mode.
- If the current_user() function is used in the actual service logic, the possible risks must be fully evaluated during the conversion between the security mode and normal mode.
Example
- If the current_user() function is not used, different views need to be created for different users to access different data.
- Authorize the view v1 permission to user hiveuser1. The user hiveuser1 can access data with type set to hiveuser1 in table1.
create view v1 as select * from table1 where type='hiveuser1';
- Authorize the view v2 permission to user hiveuser2. The user hiveuser2 can access data with type set to hiveuser2 in table1.
create view v2 as select * from table1 where type='hiveuser2';
- Authorize the view v1 permission to user hiveuser1. The user hiveuser1 can access data with type set to hiveuser1 in table1.
- If the current_user function is used, only one view needs to be created.
Authorize the view v permission to users hiveuser1 and hiveuser2. When user hiveuser1 queries view v, the current_user() function is automatically converted to hiveuser1. When user hiveuser2 queries view v, the current_user() function is automatically converted to hiveuser2.
create view v as select * from table1 where type=current_user();
Specifying Whether the ADMIN Permissions Is Required for Creating Temporary Functions
Scenario
You must have ADMIN permission when creating temporary functions on Hive of the open source community version.
MRS Hive supports the configuration of the function for creating temporary functions with ADMIN permission. The function is disabled by default, which is the same as that of the open-source community version.
You can modify configurations of this function. After the function is enabled, you can create temporary functions without ADMIN permission. If this parameter is set to false, security risks exist.
The security mode supports the configuration of whether the ADMIN permission is required for creating temporary functions, but the common mode does not support this function.
Procedure
- Log in to FusionInsight Manager. For details, see Accessing FusionInsight Manager of an MRS Cluster. Click Cluster, choose Services > Hive, click Configurations, and then All Configurations.
- Enter the parameter name in the search box, search for hive.security.temporary.function.need.admin, change the parameter value to true or false, and restart all HiveServer instances.
- If this parameter is set to true, the ADMIN permission is required for creating temporary functions, which is the same as that in the open source community.
- If this parameter is set to false, the ADMIN permission is not required for creating temporary functions.
Using Hive to Read Data in a Relational Database
Scenario
Hive allows users to create external tables to associate with other relational databases. External tables read data from associated relational databases and support Join operations with other tables in Hive.
- DB2
- Oracle
Prerequisites
The Hive client has been installed. For details, see Installing a Client.
Procedure
- Log in to the node where the Hive client is installed as the Hive client installation user .
- Run the following command to go to the client installation directory:
cd Client installation directory
For example, if the client installation directory is /opt/hadoopclient, run the following command:
cd /opt/hadoopclient
- Run the following command to configure environment variables:
source bigdata_env
- Check whether the cluster authentication mode is Security.
- If yes, run the following command to authenticate the user:
kinit Hive service user
- If no, go to Step 5.
- If yes, run the following command to authenticate the user:
- Run the following command to upload the driver JAR package of the relational database to be associated to an HDFS directory.
hdfs dfs -put directory where the JAR package is located HDFS directory to which the JAR is uploaded
For example, to upload the Oracle driver JAR package in /opt to the /tmp directory in HDFS, run the following command:
hdfs dfs -put /opt/ojdbc6.jar /tmp
- Create an external table on the Hive client to associate with the relational database, as shown in the following example.If the security mode is used, the user who creates the table must have the ADMIN permission. The ADD JAR path is subject to the actual path.
-- Example of associating with an Oracle Linux 6 database -- In security mode, set the admin permission. set role admin; -- Upload the driver JAR package of the relational database to be associated. The driver JAR packages vary according to databases. ADD JAR hdfs:///tmp/ojdbc6.jar; CREATE EXTERNAL TABLE ora_test -- The Hive table must have one more column than the database return result. This column is used for paging query. (id STRING,rownum string) STORED BY 'com.qubitproducts.hive.storage.jdbc.JdbcStorageHandler' TBLPROPERTIES ( -- Relational database table type "qubit.sql.database.type" = "ORACLE", -- Connect to the URL of the relational database through JDBC. (The URL formats vary according to databases.) "qubit.sql.jdbc.url" = "jdbc:oracle:thin:@//10.163.0.1:1521/mydb", -- Relational database driver class type "qubit.sql.jdbc.driver" = "oracle.jdbc.OracleDriver", -- SQL statement queried in the relational database. The result is returned to the Hive table. "qubit.sql.query" = "select name from aaa", -- (Optional) Match the Hive table columns to the relational database table columns. "qubit.sql.column.mapping" = "id=name", -- Relational database user "qubit.sql.dbcp.username" = "test", -- Relational database password. Commands carrying authentication passwords pose security risks. Disable historical command recording before running such commands to prevent information leakage. "qubit.sql.dbcp.password" = "xxx");
Supporting Traditional Relational Database Syntax in Hive
Overview
Hive supports the following types of traditional relational database syntax:
- Grouping
- EXCEPT and INTERSECT
Grouping
Syntax description:
- Grouping takes effect only when the Group by statement contains ROLLUP or CUBE.
- The result set generated by CUBE contains all the combinations of values in the selected columns.
- The result set generated by ROLLUP contains the combinations of a certain layer structure in the selected columns.
- Grouping: If a row is added by using the CUBE or ROLLUP operator, the output value of the added row is 1. If the row is not added by using the CUBE or ROLLUP operator, the output value of the added row is 0.
For example, the table_test table exists in Hive and the table structure is as follows:
+----------------+-------------------+--+ | table_test.id | table_test.value | +----------------+-------------------+--+ | 1 | 10 | | 1 | 15 | | 2 | 20 | | 2 | 5 | | 2 | 13 | +----------------+-------------------+--+
Run the following statement:
select id,grouping(id),sum(value) from table_test group by id with rollup;
The result is as follows:
+-------+-----------------+------+--+ | id | groupingresult | sum | +-------+-----------------+------+--+ | 1 | 0 | 25 | | NULL | 1 | 63 | | 2 | 0 | 38 | +-------+-----------------+------+--+
EXCEPT and INTERSECT
Syntax description:
- EXCEPT returns the difference of two result sets (that is, non-duplicated values return only one query).
- INTERSECT returns the intersection of two result sets (that is, non-duplicated values return by both queries).
For example, two tables test_table1 and test_table2 exist in Hive.
The table structure of test_table1 is as follows:
+-----------------+--+ | test_table1.id | +-----------------+--+ | 1 | | 2 | | 3 | | 4 | +-----------------+--+
The table structure of test_table2 is as follows:
+-----------------+--+ | test_table2.id | +-----------------+--+ | 2 | | 3 | | 4 | | 5 | +-----------------+--+
- Run the following EXCEPT statement:
select id from test_table1 except select id from test_table2;
The result is as follows:
+--------------+--+ | _alias_0.id | +--------------+--+ | 1 | +--------------+--+
- Run the following INTERSECT statement:
select id from test_table1 intersect select id from test_table2;
The result is as follows:
+--------------+--+ | _alias_0.id | +--------------+--+ | 2 | | 3 | | 4 | +--------------+--+
Creating User-Defined Hive Functions
When the built-in functions of Hive cannot meet requirements, you can compile user-defined functions (UDFs) and use them in queries.
According to implementation methods, UDFs are classified as follows:
- Common UDFs: used to perform operations on a single data row and export a single data row.
- User-defined aggregating functions (UDAFs): used to input multiple data rows and export a single data row.
- User-defined table-generating functions (UDTFs): used to perform operations on a single data row and export multiple data rows.
According to use methods, UDFs are classified as follows:
- Temporary functions: used only in the current session and must be recreated after a session restarts.
- Permanent functions: used in multiple sessions. You do not need to create them every time a session restarts.
You need to properly control the memory and thread usage of variables in UDFs. Improper control may cause memory overflow or high CPU usage.
The following uses AddDoublesUDF as an example to describe how to compile and use UDFs.
Function
AddDoublesUDF is used to add two or more floating point numbers. In this example, you can learn how to write and use UDFs.
- A common UDF must be inherited from org.apache.hadoop.hive.ql.exec.UDF.
- A common UDF must implement at least one evaluate(). The evaluate function supports overloading.
- To develop a UDF, add the hive-exec-*.jar dependency package to the project. You can obtain the package from the Hive service installation directory, for example, ${BIGDATA_HOME}/components/FusionInsight_HD_*/Hive/disaster/plugin/lib/.
Sample Code
The following is a UDF code example:
xxx indicates the name of the organization that develops the program.
package com.xxx.bigdata.hive.example.udf; import org.apache.hadoop.hive.ql.exec.UDF; public class AddDoublesUDF extends UDF { public Double evaluate(Double... a) { Double total = 0.0; // Processing logic for (int i = 0; i < a.length; i++) if (a[i] != null) total += a[i]; return total; } }
How to Use
- Packing programs as AddDoublesUDF.jar on the client node, and upload the package to a specified directory in HDFS, for example, /user/hive_examples_jars.
Both the user who creates the function and the user who uses the function must have the read permission on the file.
The following are example statements:
hdfs dfs -put ./hive_examples_jars /user/hive_examples_jars
hdfs dfs -chmod 777 /user/hive_examples_jars
- Check the cluster authentication mode.
- In security mode, log in to the beeline client as a user with the Hive management permission and run the following commands:
kinit Hive service user
beeline
set role admin;
- In common mode, run the following command:
beeline -n Hive service user
- In security mode, log in to the beeline client as a user with the Hive management permission and run the following commands:
- Define the function in HiveServer. Run the following SQL statement to create a permanent function:
CREATE FUNCTION addDoubles AS 'com.xxx.bigdata.hive.example.udf.AddDoublesUDF' using jar 'hdfs://hacluster/user/hive_examples_jars/AddDoublesUDF.jar';
addDoubles is the alias of the function, which is used in SELECT queries. xxx is typically the name of the organization that develops the program.
Run the following statement to create a temporary function:
CREATE TEMPORARY FUNCTION addDoubles AS 'com.xxx.bigdata.hive.example.udf.AddDoublesUDF' using jar 'hdfs://hacluster/user/hive_examples_jars/AddDoublesUDF.jar';
- addDoubles indicates the function alias that is used for SELECT query.
- TEMPORARY indicates that the function is used only in the current session with the HiveServer.
- Run the following SQL statement to use the function on the HiveServer:
SELECT addDoubles(1,2,3);
If an [Error 10011] error is displayed when you log in to the client again, run the reload function; command and then use this function.
- Run the following SQL statement to delete the function from the HiveServer:
DROP FUNCTION addDoubles;
Extended Applications
None
Enhancing beeline Reliability
Scenario
- When the beeline client is disconnected due to network exceptions during the execution of a batch processing task, tasks submitted before beeline is disconnected can be properly executed in Hive. When you start the batch processing task again, the submitted tasks are not executed and tasks that are not executed are executed in sequence.
- When the HiveServer service breaks down due to some reasons during the execution of a batch processing task, Hive enables that the tasks that have been successfully executed are not executed again when the same batch processing task is started again. The execution starts from the task that has not been executed from the time when HiveServer2 breaks down.
Example
- Beeline is reconnected after being disconnection.
Example:
beeline -e "${SQL}" --hivevar batchid=xxxxx
- Beeline kills the running tasks.
Example:
beeline -e "" --hivevar batchid=xxxxx --hivevar kill=true
- Log in to the beeline client and start the mechanism of reconnection after disconnection.
Log in to the beeline client and run the set hivevar:batchid=xxxx command.
Instructions:
- xxxx indicates the batch ID of tasks submitted in the same batch using the beeline client. Batch IDs can be used to identify the task submission batch. If the batch ID is not contained when a task is submitted, this feature is not enabled. The value of xxxx is specified during task execution. In the following example, the value of xxxx is 012345678901.
beeline -f hdfs://hacluster/user/hive/table.sql --hivevar batchid=012345678901
- If the running SQL script depends on the data timeliness, you are advised not to enable the breakpoint reconnection mechanism. You can use a new batch ID to submit tasks. During reexecution of the scripts, some SQL statements have been executed and are not executed again. As a result, expired data is obtained.
- If some built-in time functions are used in the SQL script, it is recommended that you do not enable the breakpoint reconnection mechanism or the use of a new batch ID for each execution. The reason is the same as above.
- A SQL script contains one or more subtasks. If the logic for deleting and creating temporary tables exist in the SQL script, it is recommended that the logic for deleting temporary tables be placed at the end of the script. If the subtasks executed after the temporary table deletion task fail to be executed and the temporary table is used in the subtasks before the temporary table deletion task, when the SQL script is executed using the same batch ID for the next time, the compilation of the subtasks (excluding the task for creating the temporary table because the creation has been completed and is not executed again, and only compilation is allowed) executed before the temporary table deletion task fails because the temporary has been deleted. In this case, you are advised to use a new batch ID to execute the script.
Parameter description:
- zk.cleanup.finished.job.interval: indicates the interval for executing the cleanup task. The default interval is 60 seconds.
- zk.cleanup.finished.job.outdated.threshold: indicates the threshold of the node validity period. A node is generated for tasks in the same batch. The threshold is calculated from the end time of the execution of the current batch task. If the time exceeds 60 minutes, the node is deleted.
- batch.job.max.retry.count: indicates the maximum number of retry times of a batch task. If the number of retry times of a batch task exceeds the value of this parameter, the task execution record is deleted. The task will be executed from the first task when the task is started next time. The default value is 10.
- beeline.reconnect.zk.path: indicates the root node for storing task execution progress. The default value for the Hive service is /beeline.
- xxxx indicates the batch ID of tasks submitted in the same batch using the beeline client. Batch IDs can be used to identify the task submission batch. If the batch ID is not contained when a task is submitted, this feature is not enabled. The value of xxxx is specified during task execution. In the following example, the value of xxxx is 012345678901.
Viewing Table Structures Using the show create Statement as Users with the select Permission
Scenario
This function is supported on Hive and Spark.
With this function enabled, if the select permission is granted to a user during Hive table creation, the user can run the show create table command to view the table structure.
Procedure
- Log in to FusionInsight Manager. For details, see Accessing FusionInsight Manager of an MRS Cluster. Click Cluster, choose Services > Hive, click Configurations, and then All Configurations.
- Choose HiveServer(Role) > Customization, add a customized parameter to the hive-site.xml parameter file, set Name to hive.allow.show.create.table.in.select.nogrant, and set Value to true. Restart all Hive instances after the modification.
- Determine whether to enable this function on the Spark client.
- If yes, download and install the Spark client again.
- If no, no further action is required.
Writing a Directory into Hive with the Old Data Removed to the Recycle Bin
Scenario
This function applies to Hive.
After this function is enabled, run the following command to write a directory into Hive: insert overwrite directory "/path1" .... After the operation is successfully performed, the old data is removed to the recycle bin, and the directory cannot be an existing database path in the Hive metastore.
- Log in to FusionInsight Manager. For details, see Accessing FusionInsight Manager of an MRS Cluster. Click Cluster, choose Services > Hive, click Configurations, and then All Configurations.
- Choose HiveServer(Role) > Customization, add a customized parameter to the hive-site.xml parameter file, set Name to hive.overwrite.directory.move.trash, and set Value to true. Restart all Hive instances after the modification.
Inserting Data to a Directory That Does Not Exist
Scenario
This function applies to Hive.
With this function enabled, run the insert overwrite directory /path1/path2/path3... command to write a subdirectory. The permission of the /path1/path2 directory is 700, and the owner is the current user. If the /path3 directory does not exist, it is automatically created and data is written successfully.
This function is supported when hive.server2.enable.doAs is set to true in earlier versions. This version supports the function when hive.server2.enable.doAs is set to false.
The parameter adjustment of this function is the same as that of the custom parameters added in Writing a Directory into Hive with the Old Data Removed to the Recycle Bin.
Procedure
- Log in to FusionInsight Manager. For details, see Accessing FusionInsight Manager of an MRS Cluster. Click Cluster, choose Services > Hive, click Configurations, and then All Configurations.
- Choose HiveServer(Role) > Customization, add a customized parameter to the hive-site.xml parameter file, set Name to hive.overwrite.directory.move.trash, and set Value to true. Restart all Hive instances after the modification.
Creating Databases and Creating Tables in the Default Database Only as the Hive Administrator
Scenario
This function is supported on Hive and Spark.
After this function is enabled, only the Hive administrator can create databases and tables in the default database. Other users can use the databases only after being authorized by the Hive administrator.
- After this function is enabled, common users are not allowed to create a database or create a table in the default database. Based on the actual application scenario, determine whether to enable this function.
- Permissions of common users are restricted. In the scenario where common users have been used to perform operations, such as database creation, table script migration, and metadata recreation in an earlier version of database, the users can perform such operations on the database in the condition that this function is disabled temporarily after the database is migrated or after the cluster is upgraded.
Procedure
- Log in to FusionInsight Manager. For details, see Accessing FusionInsight Manager of an MRS Cluster. Click Cluster, choose Services > Hive, click Configurations, and then All Configurations.
- Choose HiveServer(Role) > Customization, add a customized parameter to the hive-site.xml parameter file, set Name to hive.allow.only.admin.create, and set Value to true. Restart all Hive instances after the modification.
- Determine whether to enable this function on the Spark client.
- If yes, go to Step 4.
- If no, no further action is required.
- Choose SparkResource > Customization, add a customized parameter to the hive-site.xml parameter file, and set Name to hive.allow.only.admin.create and Value to true. Then, choose JDBCServer > Customization and repeat the preceding operations to add the customized parameter. Restart all Spark instances after the modification.
- Download and install the Spark client again.
Disabling of Specifying the location Keyword When Creating an Internal Hive Table
Scenario
This function applies to Hive and Spark.
After this function is enabled, the location keyword cannot be specified when a Hive internal table is created. Specifically, after a table is created, the table path following the location keyword is created in the default \warehouse directory and cannot be specified to another directory. If the location is specified when the internal table is created, the creation fails.
After this function is enabled, the location keyword cannot be specified during the creation of a Hive internal table. The table creation statement is restricted. If a table that has been created in the database is not stored in the default directory /warehouse, the location keyword can still be specified when the database creation, table script migration, or metadata recreation operation is performed by disabling this function temporarily.
Procedure
- Log in to FusionInsight Manager. For details, see Accessing FusionInsight Manager of an MRS Cluster. Click Cluster, choose Services > Hive, click Configurations, and then All Configurations.
- Choose HiveServer(Role) > Customization, add a customized parameter to the hive-site.xml parameter file, set Name to hive.internaltable.notallowlocation, and set Value to true. Restart all Hive instances after the modification.
- Determine whether to enable this function on the Spark client.
- If yes, download and install the Spark client again.
- If no, no further action is required.
Enabling the Function of Creating a Foreign Table in a Directory That Can Only Be Read
Scenario
This function applies to Hive and Spark.
After this function is enabled, the user or user group that has the read and execute permissions on a directory can create foreign tables in the directory without checking whether the current user is the owner of the directory. In addition, the directory of a foreign table cannot be stored in the default directory \warehouse. In addition, do not change the permission of the directory during foreign table authorization.
After this function is enabled, the function of the foreign table changes greatly. Based on the actual application scenario, determine whether to enable this function.
Procedure
- Log in to FusionInsight Manager. For details, see Accessing FusionInsight Manager of an MRS Cluster. Click Cluster, choose Services > Hive, click Configurations, and then All Configurations.
- Choose HiveServer(Role) > Customization, add a customized parameter to the hive-site.xml parameter file, set Name to hive.restrict.create.grant.external.table, and set Value to true.
- Choose MetaStore(Role) > Customization, add a customized parameter to the hivemetastore-site.xml parameter file, set Name to hive.restrict.create.grant.external.table, and set Value to true. Restart all Hive instances after the modification.
- Determine whether to enable this function on the Spark client.
- If yes, download and install the Spark client again.
- If no, no further action is required.
Authorizing Over 32 Roles in Hive
Scenario
This function applies to Hive.
The number of OS user groups is limited, and the number of roles that can be created in Hive cannot exceed 32. After this function is enabled, more than 32 roles can be created in Hive.
- After this function is enabled and the table or database is authorized, roles that have the same permission on the table or database will be combined using vertical bars (|). When the ACL permission is queried, the combined result is displayed, which is different from that before the function is enabled. This operation is irreversible. Determine whether to make adjustment based on the actual application scenario.
- If the current component uses Ranger for permission control, you need to configure related policies based on Ranger for permission management. For details, see Adding a Ranger Access Permission Policy for Hive.
- After this function is enabled, a maximum of 512 roles (including owner) are supported by default. The number is controlled by the user-defined parameter hive.supports.roles.max of MetaStore. You can change the value based on the actual application scenario.
Procedure
- Log in to FusionInsight Manager. For details, see Accessing FusionInsight Manager of an MRS Cluster. Click Cluster, choose Services > Hive, click Configurations, and then All Configurations.
- Choose MetaStore(Role) > Customization, add a custom parameter to the hivemetastore-site.xml parameter file, and set Name to hive.supports.over.32.roles and Value to true. Restart all MetaStore instances after the modification.
Restricting the Maximum Number of Maps for Hive Tasks
Scenario
- This function applies to Hive.
- This function is used to limit the maximum number of maps for Hive tasks on the server to avoid performance deterioration caused by overload of the HiveSever service.
Procedure
- Log in to FusionInsight Manager. For details, see Accessing FusionInsight Manager of an MRS Cluster. Click Cluster, choose Services > Hive, click Configurations, and then All Configurations.
- Choose MetaStore(Role) > Customization, add a customized parameter to the hivemetastore-site.xml parameter file, set Name to hive.mapreduce.per.task.max.splits, and set the parameter to a large value. Restart all Hive instances after the modification.
HiveServer Lease Isolation
Scenario
- This function applies to Hive.
- This function can be enabled to specify specific users to access HiveServer services on specific nodes, achieving HiveServer resource isolation.
Procedure
This section describes how to set lease isolation for user hiveuser for existing HiveServer instances.
- Log in to FusionInsight Manager.
- Choose Cluster > Services > Hive and click HiveServer.
- In the HiveServer list, select the HiveServer for which lease isolation is configured and choose HiveServer > Instance Configurations > All Configurations.
- In the upper right corner of the All Configurations page, search for hive.server2.zookeeper.namespace and specify its value, for example, hiveserver2_zk.
- Click Save. In the dialog box that is displayed, click OK.
- Choose Cluster > Services > Hive. Click More and select Restart Service. In the dialog box displayed, enter the password to restart the service.
- Run the beeline -u command to log in to the client and run the following command:
beeline -u "jdbc:hive2://10.5.159.13:24002/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2_zk;sasl.qop=auth-conf;auth=KERBEROS;principal=hive/hadoop.<System domain name>@<System domain name>"
Replace 10.5.159.13 with the IP address of any ZooKeeper instance. To query the IP address, choose Cluster > Services > ZooKeeper and click Instance.
hiveserver2_zk following zooKeeperNamespace= is set to the value of hive.server2.zookeeper.namespace in Step 4.
As a result, only the HiveServer whose lease isolation is configured can be logged in.- After this function is enabled, you must run the preceding command during login to access the HiveServer for which lease isolation is configured. If you run the beeline command to log in to the client, only the HiveServer that is not isolated by the lease is accessed.
- You can log in to FusionInsight Manager, choose System > Permission > Domain and Mutual Trust, and view the value of Local Domain, which is the current system domain name. hive/hadoop.<system domain name> is the username. All letters in the system domain name contained in the username are lowercase letters.
Hive Supports Isolation of Metastore instances Based on Components
Scenario
This function restricts components in a cluster to connect to specified Hive Metastore instances. By default, components can connect to all Metastore instances.
Currently, only HetuEngine, Hive, Loader, Spark, and Flink can connect to Metastore in a cluster. The Metastore instances can be allocated in a unified manner.
- This function only limits the Metastore instances accessed by component servers. Metadata is not isolated.
- Currently, Flink tasks can only connect to Metastore instances through the client.
- When spark-sql is used to execute tasks, the client is directly connected to Metastore. The client needs to be updated for the isolation to take effect.
- This function supports only isolation in the same cluster. If HetuEngine is deployed in different clusters, unified isolation configuration is not supported. You need to modify the HetuEngine configuration to connect to the specified Metastore instance.
- You are advised to configure at least two Metastore instances for each component to ensure availability during isolation configuration.
Prerequisites
The Hive service has been installed in the cluster and is running properly.
Procedure
- Log in to FusionInsight Manager and choose Cluster > Services > Hive. On the displayed page, click the Configurations tab and then All Configurations, and search for the HIVE_METASTORE_URI parameter.
- Set the value of HIVE_METASTORE_URI_DEFAULT to the URI connection string of all Metastore instances.
- Connect a component to a specified Metastore instance. Copy the value in Step 2, modify the configuration items based on the component name, save the modification, and restart the component.The following example shows how Spark connects only to two Metastore instances of Hive.
- Log in to FusionInsight Manager and choose Cluster > Services > Hive. On the displayed page, click the Configurations tab and then All Configurations, and search for the HIVE_METASTORE_URI parameter.
- Copy the default configuration of HIVE_METASTORE_URI_DEFAULT to the URI configuration item of Spark. If Spark needs to connect only to two Metastore instances, retain two nodes as required. Click Save.
- Choose Cluster > Services > Spark. Click Instance, select the instances whose configuration has expired, click More, and select Restart Instance. In the dialog box that is displayed, enter the password and click OK to restart the instances.
Switching the Hive Execution Engine to Tez
Scenario
Hive can use the Tez engine to process data computing tasks. Before executing a task, you can manually switch the execution engine to Tez.
Prerequisites
The TimelineServer role of the Yarn service has been installed in the cluster and is running properly.
Switching the Execution Engine on the Client to Tez
- Install and log in to the Hive client. For details, see Using a Hive Client.
- Run the following command to switch the engine:set hive.execution.engine=tez;
- To specify a running Yarn queue, run the set tez.queue.name=default command on the client.
- Submit and execute the Tez tasks.
- Log in to FusionInsight Manager. Choose Cluster > Name of the desired cluster > Services > Tez > TezUI(host name) to view the task execution status on the TezUI page.
Switching the Default Execution Engine of Hive to Tez
- Log in to FusionInsight Manager. Choose Cluster > Name of the desired cluster > Services > Hive > Configurations > All Configurations > HiveServer(Role), and search for hive.execution.engine.
- Set hive.execution.engine to tez.
- Click Save. In the displayed confirmation dialog box, click OK.
- Choose Dashboard > More > Restart Service to restart the Hive service. Enter the password to restart the service.
- Install and log in to the Hive client. For details, see Using a Hive Client.
- Submit and execute the Tez tasks.
- Log in to FusionInsight Manager and choose Cluster > Name of the desired cluster > Services > Tez > TezUI(host name). On the displayed TezUI page, view the task execution status.
Interconnecting Hive with External Self-Built Relational Databases
- This section describes how to connect Hive with built-in relational databases open-source MySQL and Postgres.
- After an external metadata database is deployed in a cluster with Hive data, the original metadata table will not be automatically synchronized. Therefore, before installing Hive, you need to check whether the metadata is stored in an external relational database or built-in DBService. If the metadata is stored in an external RDS, you need to configure the metadata to the external relational database when installing Hive or when there is no Hive data. After the installation, the metadata cannot be modified. Otherwise, the original metadata will be lost.
- After external metadata is imported to the MySQL database, Hive supports only table names, field names, and table description in Chinese.
Hive supports access to open source MySQL and Postgres metabases.
- Install the open source MySQL or Postgres database.
The node where the database is installed must be in the same network segment as the cluster, so that they can access each other.
- Upload the driver package.
- Postgres: Use the open source Postgres driver package to replace the existing one of the cluster. Upload the Postgres driver package postgresql-42.2.5.jar to the ${BIGDATA_HOME}/third_lib/Hive directory on all MetaStore instance nodes. To download the open-source driver package, visit https://repo1.maven.org/maven2/org/postgresql/postgresql/42.2.5/.
- MySQL: Visit the MySQL official website at https://www.mysql.com/, choose DOWNLOADS > MySQL Community(GPL) DownLoads > Connector/J, and download the driver package of the required version.
Upload the MySQL driver package of the required version to the ${BIGDATA_HOME}/third_lib/Hive directory on all Metastore nodes.
- Create a user and metadata database in the self-built database and assign all permissions on the database to the user.
For example, run the following commands in MySQL to create the database hivemeta and the user testuser, and assign all permissions on hivemeta to testuser:
create database hivemeta;
create user 'testuser'@'%' identified by 'password';
grant all privileges on hivemeta.* to 'testuser';
flush privileges;
- Import the SQL statements for creating metadata tables.
- Path of the PostgreSQL file: ${BIGDATA_HOME}/FusionInsight_HD_8.3.1/install/FusionInsight-Hive-3.1.0/hive-3.1.0/scripts/metastore/upgrade/postgres/hive-schema-3.1.0.postgres.sql
Run the following command to import the SQL file to Postgres:
./bin/psql -U username -d databasename -f hive-schema-3.1.0.postgres.sql
Specifically:
./bin/psql is in the Postgres installation directory.
username indicates the username for logging in to Postgres.
databasename indicates the database name.
- Path of the MySQL file: ${BIGDATA_HOME}/FusionInsight_HD_8.3.1/install/FusionInsight-Hive-3.1.0/hive-3.1.0/scripts/metastore/upgrade/mysql/hive-schema-3.1.0.mysql.sql
Run the following command to import the SQL file to the MySQL database:
./bin/mysql -u username -ppassword -Ddatabasename<hive-schema-3.1.0.mysql.sql
Specifically:
./bin/mysql is in the MySQL installation directory.
username indicates the user name for logging in to MySQL.
databasename indicates the database name.
- Path of the PostgreSQL file: ${BIGDATA_HOME}/FusionInsight_HD_8.3.1/install/FusionInsight-Hive-3.1.0/hive-3.1.0/scripts/metastore/upgrade/postgres/hive-schema-3.1.0.postgres.sql
- To connect Hive to the open-source database, log in to FusionInsight Manager. Choose Cluster > Name of the desired cluster > Services > Hive > Configurations > All Configurations > Hive(Service) > MetaDB, modify the following parameters, and click Save.Table 2-810 Parameter description
Parameter
Default Value
Description
javax.jdo.option.ConnectionDriverName
org.postgresql.Driver
Driver class for connecting metadata on MetaStore
- If an external MySQL database is used, the value is:
com.mysql.jdbc.Driver
- If an external Postgres database is used, the value is:
org.postgresql.Driver
javax.jdo.option.ConnectionURL
jdbc:postgresql://%{DBSERVICE_FLOAT_IP}%{DBServer}:%{DBSERVICE_CPORT}/hivemeta?socketTimeout=60
URL of the JDBC link of the MetaStore metadata
- If an external MySQL database is used, the value is:
jdbc:mysql://MySQL IP address:MySQL port number/test?characterEncoding=utf-8
- If an external Postgres database is used, the value is:
jdbc:postgresql://Postgres IP address:Postgres port number/test
javax.jdo.option.ConnectionUserName
hive${SERVICE_INDEX}${SERVICE_INDEX}
Username for connecting to the metadata database on Metastore
- If an external MySQL database is used, the value is:
- Change the Postgres database password in MetaStore. Choose Cluster > Name of the desired cluster > Services > Hive > Configurations > All Configurations > MetaStore(Role) > MetaDB, modify the following parameters, and click Save.Table 2-811 Parameter description
Parameter
Default Value
Description
javax.jdo.option.extend.ConnectionPassword
******
User password for connecting to the external metadata database on Metastore. The password is encrypted in the background.
- Log in to each MetaStore background node and check whether the local directory /opt/Bigdata/tmp exists.
- If yes, go to Step 8.
- If no, run the following commands to create one:
mkdir -p /opt/Bigdata/tmp
chmod 755 /opt/Bigdata/tmp
- Save the configuration. Choose Dashboard > More > Restart Service, and enter the password to restart the Hive service.
Redis-based CacheStore of HiveMetaStore
Scenario
The MetaStore service of Hive can cache the metadata of some tables in Redis.
- The same external RDS and Redis cache cluster can be configured for multiple Hive clusters to implement metadata sharing among multiple clusters. In multi-cluster scenarios, only Redis clusters in normal mode can be configured.
- If the Redis cache is abnormal, MetaStore automatically delivers requests to the metabase, which does not affect SQL execution.
Prerequisites
The Redis service has been installed in a cluster.
Configure Parameters Related to Metastore
- Log in to FusionInsight Manager and choose Cluster > Services > Hive. Click Configurations, search for hive.metastore.rawstore.impl, and change its value to org.apache.hadoop.hive.metastore.cache.redis.RedisCachedStore (Redis cache implementation class) to enable the Redis cache function of MetaStore.
- Click All Configurations, click MetaStore(Role), select Custom, and modify the following parameters to connect the MetaStore cache to the Redis service:Table 2-812 Parameters
Parameter
Value
Description
redis.cluster.host.and.port
xxx.xxx.xxx.xxx:22400;xxx.xxx.xxx.xxx.xxx:22401
Mandatory. Service IP address and port number of any node in the Redis cluster, in the format of IP address:Port number;IP address:Port number. The value cannot end with a semicolon (;).
metastore.cached.rawstore.cached.object.whitelist
catalog.database.table,catalog.database.table
Optional. Cache table list. Tables that are listed will be cached to Redis. You are advised to list tables that contain over 100,000 partitions. Multiple tables are separated by commas (,). The default separator is .*, that is, all tables are cached.
NOTE:- The table name consists of catalog.database.table. The default catalog is hive.
- If there are excessive tables, default separator .* will cause large volumes of data cannot be uploaded to Redis within the prewarm period.
redis.cache.prewarm.cron
cronTab expression, for example, 0 0 16 * *?
(Optional) Periodically execute the corn expression of prewarm to update the data cached in the metabase to redisCache for synchronization.
NOTE:You are adivsed to configure expression execution once a day during off-peak hours. Partitioned tables are synchronized only when the number of partitions in the metabase is inconsistent with that in the Redis cache.
metastore.cached.rawstore.catalogs
hive
(Optional) Catalog to be cached. The default value is hive.
jedis.pool.max.wait.mills
30,000
(Optional) Obtain the Redis connection timeout interval. In security mode, the timeout interval can be longer. The unit is ms. The default value is 30,000 ms.
jedis.pool.max.idle
200
(Optional) Maximum number of idle connections in the Jedis connection pool. You are advised to set this parameter to the value of max.total. The default value is 200.
jedis.pool.max.total
200
(Optional) Maximum number of connections in the Jedis connection pool. The default value is 200.
redis.security.enabled
true or false
(Optional) Whether to enable the Redis cache security mode. The default value is true, indicating that the security mode is enabled.
- Save the configuration and choose Dashboard > More > Restart Service to restart the Hive service.
Precautions
If Redis is switched back to the native non-cache mode and then switched back after a period of time, the added, deleted, or modified metadata cannot be synchronized to the Redis when the database is used. Therefore, before switching back, you must clear the cache table in the Redis and synchronize the metadata again in either of the following two clearing modes:
- Log in to the Redis client and run the flushall command on all Redis nodes.
- Log in to the Redis client and run the following commands to change the two Redis identifiers. {hiveServiceName} is the value of HIVE_DEFAULT_GROUP in the Metastore configuration file ENV_VARS. The default value is hive.
set {hiveServiceName}-hive-isRedisAvailable false
del {hiveServiceName}-hive-isCanPrewarm
Hive Supporting Reading Hudi Tables
Hive External Tables Corresponding to Hudi Tables
A Hudi source table corresponds to a copy of HDFS data. The Hudi table data can be mapped to a Hive external table through the Spark component, Flink component, or Hudi client. Based on the external table, Hive can easily perform real-time query, read-optimized view query, and incremental view query.
- Different view queries are provided for different types of Hudi source tables:
- When the Hudi source table is a Copy-On-Write (COW) table, it can be mapped to a Hive external table. The table supports real-time query and incremental view query.
- When the Hudi source table is a Merge-On-Read (MOR) table, it can be mapped to two Hive external tables (RO table and RT table). The RO table supports read-optimized view query, and the RT table supports real-time view query and incremental view query.
- Hive external tables cannot be added, deleted, or modified (including insert, update, delete, load, merge, alter and msck). Only the query operation (select) is supported.
- Granting table permissions: The update, alter, write, and all permissions cannot be modified.
- Backup and restoration: The RO and RT tables are mapped from the same Hudi source table. When one table is backed up, the other table is also backed up. The same applies to restoration. Therefore, only one table needs to be backed up.
- Component versions:
- Hive: FusionInsight_HD_8.3.1; Hive kernel version 3.1.0
- Spark: FusionInsight_Spark_8.3.1; Hudi kernel version 0.11.0
Creating Hive External Tables Corresponding to Hudi Tables
Generally, Hudi table data is synchronized to Hive external tables when the data is imported to the lake. In this case, you can directly query the corresponding Hive external tables in Beeline. If the data is not synchronized to the Hive external tables, you can use the Hudi client tool run_hive_sync_tool.sh to synchronize data manually.
Querying Hive External Tables Corresponding to Hudi Tables
Prerequisites
Before using Hive to perform incremental query on Hudi tables, you need to set another three parameters in Table 2-813. The three parameters are table-level parameters. Each Hudi source table corresponds to three parameters, where hudisourcetablename indicates the name of the Hudi source table (not the name of the Hive external table).
Parameter |
Default Value |
Description |
---|---|---|
hoodie. hudisourcetablename.consume.mode |
None |
Query mode of the Hudi table.
|
hoodie. hudisourcetablename.consume.start.timestamp |
None |
Start time of the incremental query on the Hudi table.
|
hoodie. hudisourcetablename.consume.max.commits |
None |
The incremental query on the Hudi table is based on the number of commits after hoodie.hudisourcetablename.consume.start.timestamp.
|
Querying a Hudi COW Table
For example, the name of a Hudi source table of the COW type is hudicow, and the name of the mapped Hive external table is hudicow.
- Real-time view query on the COW table:
Select * from hudicow;
- Incremental query on the COW table: Set three incremental query parameters based on the name of the Hudi source table. The where clause of the incremental query statements must contain `_hoodie_commit_time`>'xxx', where xxx indicates the value of hoodie.hudisourcetablename.consume.start.timestamp.
set hoodie.hudicow.consume.mode= INCREMENTAL;
set hoodie.hudicow.consume.max.commits=3;
set hoodie.hudicow.consume.start.timestamp= 20200427114546;
select count(*) from hudicow where `_hoodie_commit_time`>'20200427114546';
Querying a Hudi MOR Table
For example, the name of a Hudi source table of the MOR type is hudimor, and the two mapped Hive external tables are hudimor_ro (RO table) and hudimor_rt (RT table).
- Read-optimized view query on the RO table:
Select * from hudicow_ro;
- Real-time view query on the RT table:
Select * from hudicow_rt;
- Incremental query on the RT table: Set three incremental query parameters based on the name of the Hudi source table. The where clause of the incremental query statements must contain `_hoodie_commit_time`>'xxx', where xxx indicates the value of hoodie.hudisourcetablename.consume.start.timestamp.
set hoodie.hudimor.consume.mode=INCREMENTAL;
set hoodie.hudimor.consume.max.commits=-1;
set hoodie.hudimor.consume.start.timestamp=20210207144611;
select * from hudimor_rt where `_hoodie_commit_time`>'20210207144611';
set hoodie.hudisourcetablename.consume.mode=INCREMENTAL; is used only for the incremental query on the table. To switch to another query mode, run set hoodie.hudisourcetablename.consume.mode=SNAPSHOT;.
Querying Hive External Tables Corresponding to Hudi Schema Evolution Tables
If the Hudi table is a schema evolution table (some fields in the table have been modified), you need to add set hive.exec.schema.evolution=true; when Hive queries the table.
The following uses the query of the real-time view of a COW table as an example. To query other views, you need to add this parameter.
- Real-time view query on the COW table:
set hive.exec.schema.evolution=true;
select * from hudicow;
Hive Supporting Cold and Hot Storage of Partitioned Metadata
Cold and Hot Storage of Partitioned Metadata
- The metadata that have not been used for a long time is moved to a backup table to reduce the pressure on metadata databases. This process is called partitioned data freezing. The partitions in which data is moved are cold partitions, partitions that are not frozen are hot partitions. A table with a cold partition is a frozen table. Moving the frozen data back to the original metadata table is called partitioned data unfreezing.
- When a partition is changed from a hot partition to a cold partition, only metadata is identified. The partition path and data file content on the HDFS service side do not change.
Freezing a Partition
The user who creates the table can freeze one or more partitions based on filter criteria. The format is freeze partitions Database name Table name where Filter criteria.
Example:
freeze partitions testdb.test where year <= 2021; freeze partitions testdb.test where year<=2021 and month <= 5; freeze partitions testdb.test where year<=2021 and month <= 5 and day <= 27;
Unfreezing a Partition
The user who creates the table can unfreeze one or more partitions based on filter criteria. The format is unfreeze partitions Database name Table name where Filter criteria. Example:
unfreeze partitions testdb.test where year <= 2021; unfreeze partitions testdb.test where year<=2021 and month <= 5; unfreeze partitions testdb.test where year<=2021 and month <= 5 and day <= 27;
Querying Tables with Frozen Data
- Querying all frozen tables in the current database
show frozen tables;
- Querying all frozen tables in the dbname database
show frozen tables in dbname;
Querying Frozen Partitions of a Frozen Table
Querying frozen partitions
show frozen partitions table;
- By default, only partitions of the int, string, varchar, date, or timestamp type can be frozen in the metadata database.
- For external metadata databases, only the Postgres database is supported, and only partitions of the int, string, varchar, or timestamp type can be frozen.
- You need to unfreeze data to restore the metadata of a frozen table using MSCK. If a frozen table has been backed up, you can run msck repair to restore the table, and you can only run this command to unfreeze the table.
- You need to unfreeze data before renaming a frozen partition. Otherwise, a message indicating that the partition does not exist is displayed.
- When a table that contains frozen data is deleted, the frozen data is also deleted.
- When a partition that contains frozen data is deleted, information about the frozen partition and HDFS service data is not deleted.
- When you run the select command to query data, the criteria for filtering the data in cold partitions is automatically added. The query result does not contain the data in cold partitions.
- When you run the show partitions table command to query the partitioned data in the table, the query result does not contain the data in cold partitions. You can run the show frozen partitions table command to query frozen partitions.
- If Redis-based CacheStore is enabled, cold and hot storage of metadata is not supported.
Hive Supporting ZSTD Compression Formats
Zstandard (ZSTD) is an open-source lossless data compression algorithm. Its compression performance and compression ratio are better than those of other compression algorithms supported by Hadoop. Hive with this feature supports tables in ZSTD compression formats. The ZSTD compression formats supported by Hive include ORC, RCFile, TextFile, JsonFile, Parquet, Sequence, and CSV.
You can create a table in ZSTD compression format as follows:
- To create a table in ORC format, specify TBLPROPERTIES("orc.compress"="zstd").
create table tab_1(...) stored as orc TBLPROPERTIES("orc.compress"="zstd");
- To create a table in Parquet format, specify TBLPROPERTIES("parquet.compression"="zstd").
create table tab_2(...) stored as parquet TBLPROPERTIES("parquet.compression"="zstd");
- To create a table in other formats or common formats, run the following commands to set the compress.codec parameters to org.apache.hadoop.io.compress.ZStandardCode.
set hive.exec.compress.output=true;
set mapreduce.map.output.compress=true;
set mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.ZStandardCodec;
set mapreduce.output.fileoutputformat.compress=true;
set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.ZStandardCodec;
set hive.exec.compress.intermediate=true;
create table tab_3(...) stored as textfile;
- The SQL operations on a table compressed using ZSTD are the same as those on a common compressed table. Addition, deletion, query, and aggregation are supported.
- To default the compression format of Parquet tables to ZSTD, run the following command on the Hive Beeline client:
set hive.parquet.default.compression.codec=zstd;
This command is applied to the current session only.
Hive Partitioned Tables Support OBS and HDFS Storage Sources
Scenario
In the scenario where storage and compute resources are separated, you can specify different storage sources, for example, OBS or HDFS, for partitions in a Hive partitioned table.
Prerequisites
The Hive client has been installed. For details, see Installing a Client.
Example
- Log in to the node where the Hive client is installed as the Hive client installation users.
- Run the following command to go to the client installation directory:
cd Client installation directory
For example, if the client installation directory is /opt/hadoopclient, run the following command:
cd /opt/hadoopclient
- Run the following command to configure environment variables:
source bigdata_env
- Check whether the cluster authentication mode is in security mode.
- If yes, run the following command to authenticate the user:
kinit Hive service user
- If no, go to 5.
- If yes, run the following command to authenticate the user:
- Run the following command to log in to the Hive client:
beeline
- Run the following commands to create a Hive partitioned table named table_1, and set the path of partitions pt=?2021-12-12 and pt='2021-12-18 to hdfs//xxx and obs://xxx respectively:
create table table_1(id string) partitioned by(pt string) [stored as [orc|textfile|parquet|...]];
alter table table_1 add partition(pt='2021-12-12') location 'hdfs://xxx';
alter table table_1 add partition(pt='2021-12-18') location 'obs://xxx';
- After data is inserted into table_1, it is stored in the corresponding storage source. You can run the desc command to view the location of each partition.
desc formatted table_1 partition(pt='2021-12-18');
Locating Abnormal Hive Files
Scenario
- Data files stored in Hive are abnormal due to misoperations or disk damage, thereby causing task execution failures or incorrect data results.
- Common non-text data files can be located using the specified tool.
Procedure
- Log in to the node where the Hive service is installed as user omm and run the following command to go to the Hive installation directory:
cd ${BIGDATA_HOME}/FusionInsight_HD_*/install/FusionInsight-Hive-*/hive-*/bin
- Run the following tool to locate abnormal Hive files:
sh hive_parser_file.sh [--help] <filetype> <command> <input-file|input-directory>
Table 2-814 describes the related parameters.
Note: You can run only one command at a time.
Table 2-814 Parameter descriptionParameter
Description
Remarks
filetype
Specifies the format of the data file to be parsed. Currently, only the ORC, RC (RCFile), and Parquet formats are supported.
Currently, data files in the RC format can only be viewed.
-c
Prints the column information in the current metadata.
The column information includes the class name, file format, and sequence number.
-d
Prints data in a data file. You can limit the data volume using the limit parameter.
The data is the content of the specified data file. Note that only one value can be specified for the limit parameter at a time.
-t
Prints the time zone to which the data is written.
The time zone is the zone to which the file is written.
-h
Prints the help information.
Help information.
-m
Prints information about various storage formats.
The information varies based on the storage format. For example, if the file format is ORC, information such as strip and block size will be printed.
-a
Prints detailed information.
The detailed information, including the preceding parameters, is displayed.
input-file
Specifies the data files to be input.
If the input directory contains a file of the supported formats, the file will be parsed. Otherwise, this operation is omitted. You can specify a local file or an HDFS/OBS file or directory.
input-directory
Specifies the directory where the input data file is located. This parameter is used when there are multiple subfiles.
- Example:
sh hive_parser_file.sh orc -d limit=100 hdfs://hacluster/user/hive/warehouse/orc_test
If the file name does not contain a prefix similar to hdfs://hacluster, the local file is read by default.
Using the ZSTD_JNI Compression Algorithm to Compress Hive ORC Tables
Scenario
ZSTD_JNI is a native implementation of the ZSTD compression algorithm. Compared with ZSTD, ZSTD_JNI has higher compression read/write efficiency and compression ratio, and allows you to specify the compression level as well as the compression mode for data columns in a specific format.
Currently, only ORC tables can be compressed using ZSTD_JNI. By contrast, ZSTD enables you to compress tables in the full storage format. Therefore, you are advised to use this feature only when you have high requirements on data compression.
Example
- Log in to the node where the client is installed as the Hive client installation user.
- Run the following command to switch to the client installation directory, for example, /opt/hadoopclient:
cd /opt/hadoopclient
- Run the following command to configure environment variables:
source bigdata_env
- Check whether the cluster authentication mode is in security mode.
- Run the following command to log in to the Hive client:
beeline
- Create a table in ZSTD_JNI compression format as follows:
- Run the following example command to set the orc.compress parameter to ZSTD_JNI when using this compression algorithm to create an ORC table:
create table tab_1(...) stored as orc TBLPROPERTIES("orc.compress"="ZSTD_JNI");
- The compression level of ZSTD_JNI ranges from 1 to 19. A larger value indicates a higher compression ratio but a slower read/write speed. A smaller value indicates a lower compression ratio but a faster compression speed compared with read/write speed and the other way around. The default value is 6. You can set the compression level through the orc.global.compress.level parameter, as shown in the follows.
create table tab_1(...) stored as orc TBLPROPERTIES("orc.compress"="ZSTD_JNI", 'orc.global.compress.level'='3');
- This compression algorithm allows you to compress service data and columns in a specific data format. Currently, data in the following formats is supported: JSON data columns, Base64 data columns, timestamp data columns, and UUID data columns. You can achieve this function by setting the orc.column.compress parameter during table creation.
The following example code shows how to use ZSTD_JNI to compress data in the JSON, Base64, timestamp, and UUID formats.
create table test_orc_zstd_jni(f1 int, f2 string, f3 string, f4 string, f5 string) stored as orc
TBLPROPERTIES('orc.compress'='ZSTD_JNI', 'orc.column.compress'='[{"type":"cjson","columns":"f2"},{"type":"base64","columns":"f3"},{"type ":"gorilla","columns":{"format": "yyyy-MM-dd HH:mm:ss.SSS", "columns": "f4"}},{"type":"uuid","columns":"f5"}]');
You can insert data in the corresponding format based on the site requirements to further compress the data.
- Run the following example command to set the orc.compress parameter to ZSTD_JNI when using this compression algorithm to create an ORC table:
HiveMetaStore Access Pressure Reduction
Scenario
There is high read and write pressure on a single underlying database when excessive number (even 10s of millions) of metadata tables or partitions are planned for services. This HiveMetaStore access pressure reduction feature is introduced to reduce read/write pressure by creating multiple relational databases to read and write metadata.
Procedure
- Create a secondary relational database in HiveMetaStore. If the secondary relational database is DBService, go to Step 2. If the secondary relational database is MySQL, go to Step 3.
- The type and version of the secondary relational database must be the same as those of the current storage database. Currently, a maximum of five secondary relational databases can be created.
- The secondary relational database can be built-in DBService provided by MRS or external MySQL.
- If the secondary relational database is DBService, perform the following operations:
- Log in to FusionInsight Manager and choose Cluster > Services. On the Services page that is displayed, click Add Service. In the Add Service dialog box, add a DBService, for example, named DBService-1. For details about how to add a service, see "Adding a Service" in MapReduce Service (MRS) 3.3.1-LTS User Guide (for Huawei Cloud Stack 8.3.1) in the MapReduce Service (MRS) 3.3.1-LTS Usage Guide (for Huawei Cloud Stack 8.3.1).
- Copy the ${BIGDATA_HOME}/FusionInsight_HD_*/install/FusionInsight-Hive-3.1.0/hive-3.1.0/scripts/multipalstore-migrationtool/ctlgs_dbs_export_import/postgresql directory on any node where Hive is installed to any temporary directory on the node where the primary database resides as user omm, and go to the directory as this user.
- Configure database parameters in the dbProperties.template file.
EXPORT_DB_PORT=20051 #Port for the primary database to connect to other databases EXPORT_DB_NAME=hivemeta #Name of the Hive database in the primary database. The default value is hivemeta. IMPORT_DB_HOST= #IP address of the database connected to the secondary database. Enter the floating IP address if the primary database is a DBService database. IMPORT_DB_PORT=20051 #Port for the primary database to connect to other databases. IMPORT_DB_NAME=hivemeta #Name of the Hive database in the secondary database. The default value is hivemeta. IMPORT_DB_HIVE_USER= testuser #Owner created for the secondary database. IMPORT_TMP_PATH=/home/omm/testinit #Specified empty path name of the node where the secondary database resides. Operations on this path require the read, write, and execute permissions of system user omm. This directory temporarily stores data exported from the primary database and is automatically deleted after the data is imported to the secondary database.
- Run the ctlgs_dbs_operator.sh script, enter the password of user IMPORT_DB_HIVE_USER that is configured in the dbProperties.template file. Run the following script to initialize data:
sh ctlgs_dbs_operator.sh dbProperties.template
- Configure database parameters in the dbProperties.template file.
- If the secondary relational database is a MySQL database, perform the following steps:
- Install and configure the MySQL database by referring to Interconnecting Hive with External Self-Built Relational Databases.
- Initialize the metadata in the default secondary relational database stored in HiveMetaStore and the table mapping information of the primary database.Copy the ${BIGDATA_HOME}/FusionInsight_HD_*/install/FusionInsight-Hive-3.1.0/hive-3.1.0/scripts/multipalstore-migrationtool/ctlgs_dbs_export_import/mysql directory on any node where Hive is installed to any temporary directory on the node where the primary database resides, and go to the directory.
- Configure database parameters in the dbProperties.template file.
EXPORT_DB_USER= # User of the primary database EXPORT_DB_PORT=3306 #Port for the primary database to connect to other databases EXPORT_DB_NAME=hivemeta #Name of the Hive database in the primary database. The default value is hivemeta. IMPORT_DB_USER= #Username of the secondary database IMPORT_DB_HOST= #IP address of the secondary database IMPORT_DB_PORT= #Port number of the secondary database IMPORT_DB_NAME=hivemeta #Name of the Hive database in the secondary database. The default value is hivemeta. IMPORT_TMP_PATH= #Temporary path of the secondary database. The user who executes the ctlgs_dbs_operator.sh script must have the read, write, and execute permissions on the path.
- Run the ctlgs_dbs_operator.sh script, enter the passwords of users EXPORT_DB_USER and IMPORT_DB_USER that are configured in the dbProperties.template file. Run the following script to initialize data:
sh ctlgs_dbs_operator.sh dbProperties.template
- Configure database parameters in the dbProperties.template file.
- Perform the following steps to partially migrate metadata in the primary database. Otherwise, go to Step 5. Fully evaluate the impact on services before performing this operation and exercise caution when performing it.
- Currently, only the metadata of the built-in DBService database can be migrated.
- The Hive service must be stopped during data migration.
Log in to FusionInsight Manager and choose Cluster > Services > Hive, and click Stop.
- If the metabase is a DBService database, copy the initialization script directory ${BIGDATA_HOME}/FusionInsight_HD_*/install/FusionInsight-Hive-3.1.0/hive-3.1.0/scripts/multipalstore-migrationtool/migration_export_import/ on any node where the Hive service is installed to any temporary directory on the primary database node.
- Set database parameters in the dbProperties.template file.
EXPORT_DB_PORT=20051 #Port for the primary database to connect to other databases EXPORT_DB_NAME=hivemeta #Name of the Hive database in the primary database. The default value is hivemeta. IMPORT_DB_HOST= #IP address of the database interconnecting with the secondary database. Enter the floating IP address if the primary database is a DBService database. IMPORT_DB_PORT=20051 #Port number of the secondary database. IMPORT_DB_NAME=hivemeta #Name of the Hive database in the secondary database. The default value is hivemeta. IMPORT_DB_HIVE_USER= testuser #User of the new secondary database. IMPORT_TMP_PATH=/home/omm/testmigration #Specified empty path name of the node where the secondary database resides. Operations on this path require the read, write, and execute permissions of system user omm. This directory temporarily stores data exported from the primary database and is automatically deleted after the data is imported to the secondary database. IMPORT_RDS_RDB_KEY= javax.jdo.option.ConnectionURL1 #JDBC link URL of the secondary database.
- Execute the queryMaxPartitionTable.sh script to view partition table information in descending order of the number of partitions. For example, to view the 20 tables with the largest number of partitions, run the following command:
sh queryMaxPartitionTable.sh dbProperties.template 20
- Create a text file in the temporary migration directory on the primary database node. The file content is the name of the primary database to be migrated. The format is Table name Table name, and the following is an example:
vim tbl_propertie
default.testtable1 default.testtable2 default.testtable1 testdb1.test1 testdb2.test2
- Run the migrationtool.sh script, enter the password of user IMPORT_DB_HIVE_USER that is configured in the IMPORT_DB_HIVE_USER file. Run the following script to migrate data:
sh migrationtool.sh tbl_propertie dbProperties.template
- After the migrationtool.sh script is successfully executed, run the delete.sh script to delete the migrated data from the primary database.
sh delete.sh tbl_propertie dbProperties.template
- Log in to FusionInsight Manager and choose Cluster > Services > Hive. On the page that is displayed, click the Configurations tab then the All Configurations sub-tab. In the navigation pane on the left, choose MetaStore(Role) and change the value of hive.metastore.rawstore.impl to org.apache.hadoop.hive.metastore.multi.MultiObjectStore.
Choose MetaStore(Role) > MultiMetaDB and modify the following parameters:
Table 2-815 ParametersParameter
Description
javax.jdo.option.ConnectionURLx
JDBC URL of the metadata in MetaStore
- If the secondary database is MySQL, the value is as follows:
jdbc:mysql://MySQL IP address:MySQL port number/test?characterEncoding=utf-8
- If the secondary database is DBService or PostgreSQL, the value is as follows:
jdbc:postgresql://PostgreSQL IP address:PostgreSQL port number/hivemeta
javax.jdo.option.ConnectionUserNamex
Username for connecting to the secondary relational database on MetaStore
javax.jdo.option.ConnectionPasswordx
Password for connecting to the secondary relational database on MetaStore
NOTE:The system automatically encrypts user passwords.
Each secondary relational database has three configuration items: JDBC URL, username, and password. The numbers at the end of these three parameters of a secondary relational database are the same (1, 2, 3, 4, or 5). That is, a maximum of five secondary relational databases can be created.
For example, if a secondary relational database is created, you need to configure javax.jdo.option.ConnectionURL1, javax.jdo.option.ConnectionUserName1, and javax.jdo.option.ConnectionPassword1.
- If the secondary database is MySQL, the value is as follows:
- (Optional) Add the metastore-ext.write.rdb.custom.address parameter so that new tables can be written in the specified RDB.
Choose Cluster > Services > Hive, click Configurations then All Configurations, click MetaStore(Role), select Customization, and add the following configuration to the hive.metastore.customized.configs parameter: Name: metastore-ext.write.rdb.custom.address; Value: URL of the Metastore metadata JDBC, that is, javax.jdo.option.ConnectionURLx. Separate multiple values with commas (,) or semicolons (;).
For example, the metadata of a new table is written to the RDB specified by javax.jdo.option.ConnectionURL1 and javax.jdo.option.ConnectionURL2.
- Click Save. In the dialog box that is displayed, click OK to save the configuration.
- Click the Dashboard tab. On this tab page, choose More > Restart Service, and enter the password to restart the Hive service.
Load Balancing for Hive MetaStore Client Connection
Scenario
The client connection of Hive MetaStore supports load balancing. That is, heavy load of a single MetaStore node during heavy service traffic can be avoided by connecting to the node with the least connections based on the connection number recorded in ZooKeeper. Enabling this function does not affect the original connection mode.
Procedure
- Log in to FusionInsight Manager, click Cluster, choose Services > Hive, click Configurations, and then All Configurations.
- Search for the hive.metastore-ext.balance.connection.enable parameter and set its value to true.
- Click Save.
- Click Instance, select all instances, choose More > Restart Instance, enter the password, and click OK to restart all Hive instances.
- For other components that connect to MetaStore, add the hive.metastore-ext.balance.connection.enable parameter and set its value to true.
The following uses Spark as an example:
- Log in to FusionInsight Manager, choose Cluster > Services > Spark, and click Configurations.
- Click Customization, add a custom parameter hive.metastore-ext.balance.connection.enable to all hive-site.xml parameter files, set its value to true, and click Save.
- Click Instance, select all configuration-expired instances, choose More > Restart Instance, enter the password, and click OK to restart them.
In the Hive multi-service or multi-instance scenario, add custom configuration parameter metastore-ext.balance.zookeeper.namespace to the parameter files (such as hive-site.xml and hdfs-site.xml) of other components that connect to MetaStore. Set this parameter to metastore for Hive, metastore-1 for Hive-1, and this rule applies. The value is the ZooKeeper directory of the MetaStore to be connected. You can log in to the ZooKeeper client and run the ls / command to check the directory.
For example:
- To enable this function, add custom parameter metastore-ext.balance.zookeeper.namespace and set it to metastore-1 if Spark uses Hive-1.
- To enable this function, add custom parameter metastore-ext.balance.zookeeper.namespace and set it to metastore-2 if Spark uses Hive-2.
Data Import and Export in Hive
Importing and Exporting Table/Partition Data in Hive
Scenario
In big data application scenarios, data tables in Hive usually need to be migrated to another cluster. You can run the Hive import and export commands to migrate data in tables. That is, you can run the export command to export Hive tables from the source cluster to the HDFS of the target cluster, run the import command in the target cluster to import the exported data to the corresponding Hive table.
The Hive table import and export function does not support encrypted tables, HBase external tables, Hudi tables, view tables, or materialized view tables.
Prerequisites
- If Hive tables or partition data is imported or exported across clusters and Kerberos authentication is enabled for both the source and destination clusters, configure cross-cluster mutual trust. For details, see Configuring Cross-Manager Mutual Trust Between Clusters.
- If you want to run the import or export command to import or export tables or partitions created by other users, grant the corresponding table permission to the users.
- If Ranger authentication is not enabled for the cluster, log in to FusionInsight Manager to grant the Select Authorization permission of the table corresponding to the role to which the user belongs. For details, see section Configuring Permissions for Hive Tables, Columns, or Databases.
- If Ranger authentication is enabled for the cluster, grant users the permission to import and export tables. For details, see Adding a Ranger Access Permission Policy for Hive.
- Enable the inter-cluster copy function in the source cluster and destination cluster. For details, see Enabling Cross-Cluster Replication.
- Configure the HDFS service address parameter for the source cluster to access the destination cluster.
Log in to FusionInsight Manager of the source cluster, click Cluster, choose Services > Hive, and click Configuration. On the displayed page, search for hdfs.site.customized.configs, add custom parameter dfs.namenode.rpc-address.haclusterX, and set its value to Service IP address of the active NameNode instance node in the destination cluster:RPC port. Add custom parameter dfs.namenode.rpc-address.haclusterX1 and set its value to Service IP address of the standby NameNode instance node in the destination cluster:RPC port. The RPC port of NameNode is 25000 by default. After saving the configuration, roll-restart the Hive service.
Procedure
- Log in to the node where the client is installed in the destination cluster as the Hive client installation user.
- Run the following command to switch to the client installation directory, for example, /opt/hadoopclient:
cd /opt/hadoopclient
- Run the following command to configure environment variables:
source bigdata_env
- If Kerberos authentication is enabled for the cluster, run the following command to authenticate the user. Otherwise, skip this step.
kinit Hive service user
- Run the following command to log in to the Hive client in the destination cluster:
beeline
- Run the following command to create the export_test table:
create table export_test(id int) ;
- Run the following command to insert data to the export_test table:
insert into export_test values(123);
- Repeat Step 1 to Step 4 in the destination cluster and run the following command to create an HDFS path for storing the exported export_test table:
dfs -mkdir /tmp/export
- Run the following command to log in to the Hive client:
beeline
- Import and export the export_test table.
The Hive import and export commands can be used to migrate table data in the following modes. Select a proper data migration mode as required.
- Mode 1: Table export and import
- Run the following command in the source cluster to export the metadata and service data of the export_test table to the directory created in Step 8:
export table export_test to 'hdfs://haclusterX/tmp/export';
- Run the following command in the destination cluster to import the table data exported in 10.a to the export_test table:
import from '/tmp/export';
- Run the following command in the source cluster to export the metadata and service data of the export_test table to the directory created in Step 8:
- Mode 2: Renaming a table during the import
- Run the following command in the source cluster to export the metadata and service data of the export_test table to the directory created in Step 8:
export table export_test to 'hdfs://haclusterX/tmp/export';
- Run the following command in the destination cluster to import the table data exported in 10.a to the import_test table:
import table import_test from '/tmp/export';
- Run the following command in the source cluster to export the metadata and service data of the export_test table to the directory created in Step 8:
- Mode 3: Partition export and import
- Run the following commands in the source cluster to export the pt1 and pt2 partitions of the export_test table to the directory created in Step 8:
export table export_test partition (pt1="in", pt2="ka") to 'hdfs://haclusterX/tmp/export';
- Run the following command in the destination cluster to import the table data exported in 10.a to the export_test table:
import from '/tmp/export';
- Run the following commands in the source cluster to export the pt1 and pt2 partitions of the export_test table to the directory created in Step 8:
- Mode 4: Exporting table data to a Partition
- Run the following command in the source cluster to export the metadata and service data of the export_test table to the directory created in Step 8:
export table export_test to 'hdfs://haclusterX/tmp/export';
- Run the following command in the destination cluster to import the table data exported in 10.a to the pt1 and pt2 partitions of the import_test table:
import table import_test partition (pt1="us", pt2="tn") from '/tmp/export';
- Run the following command in the source cluster to export the metadata and service data of the export_test table to the directory created in Step 8:
- Mode 5: Specifying the table location during the import
- Run the following command in the source cluster to export the metadata and service data of the export_test table to the directory created in Step 8:
export table export_test to 'hdfs://haclusterX/tmp/export';
- Run the following command in the destination cluster to import the table data exported in 10.a to the import_test table and specify its location as tmp/export:
import table import_test from '/tmp' location '/tmp/export';
- Run the following command in the source cluster to export the metadata and service data of the export_test table to the directory created in Step 8:
- Mode 6: Exporting data to an external table
- Run the following command in the source cluster to export the metadata and service data of the export_test table to the directory created in Step 8:
export table export_test to 'hdfs://haclusterX/tmp/export';
- Run the following command in the destination cluster to import the table data exported in 10.a to external table import_test:
import external table import_test from '/tmp/export';
- Run the following command in the source cluster to export the metadata and service data of the export_test table to the directory created in Step 8:
Before exporting table or partition data, ensure that the HDFS path for storage has been created and is empty. Otherwise, the export fails.
When partitions are exported or imported, the exported or imported table must be a partitioned table, and data of multiple partition values of the same partition field cannot be exported.
During the data import:- If the import from '/tmp/export'; statement is used to import a table, the table name is not specified, and the imported data is saved to the table path with the same name as the source table. Pay attention to the following points:
- If there is no table with the same name as that in the source cluster in the destination cluster, such a table will be created during the table import.
- Otherwise, the HDFS directory of the table must be empty, or the import fails.
- If the import external table import_test from '/tmp/export'; statement is used to import a table, the exported table is imported to the specified table. Pay attention to the following points:
- If there is no table with the same name as the specified table exists in the destination cluster, such a table will be created during the table import.
- Otherwise, the HDFS directory of the table must be empty, or the import fails.
hacluster X is the value of haclusterX in new custom parameterdfs.namenode.rpc-address.haclusterX.
- Mode 1: Table export and import
Importing and Exporting Hive Databases
Scenario
In big data application scenarios, Hive databases and all tables in these databases are usually migrated to another cluster. You can run the Hive database export and import commands to migrate a complete database.
The Hive database import and export function does not support encrypted tables, HBase external tables, Hudi tables, view tables, or materialized view tables.
Prerequisites
- If Hive databases are imported or exported across clusters and Kerberos authentication is enabled for both the source and destination clusters, configure cross-cluster mutual trust. For details, see Configuring Cross-Manager Mutual Trust Between Clusters.
- If you want to run the dump or load command to import or export databases created by other users, grant the corresponding database permission to the users.
- If Ranger authentication is not enabled for the cluster, log in to FusionInsight Manager to grant the administrator rights of the role to which the user belongs. For details, see section Creating a Hive Role.
- If Ranger authentication is enabled for the cluster, grant users the permission to dump and load databases. For details, see Adding a Ranger Access Permission Policy for Hive.
- Enable the inter-cluster copy function in the source cluster and destination cluster. For details, see Enabling Cross-Cluster Replication.
- Configure the HDFS service address parameter for the source cluster to access the destination cluster.
Log in to FusionInsight Manager of the source cluster, choose Cluster > Services > Hive, and click Configurations. On the page that is displayed, search for hdfs.site.customized.configs, add custom parameter dfs.namenode.rpc-address.haclusterX, and set it to Service IP address of the active NameNode node in the target cluster:RPC port. Add custom parameter dfs.namenode.rpc-address.haclusterX1 and set it to Service IP address of the standby NameNode node in the target cluster:RPC port. The RPC port of NameNode is 25000 by default. After saving the configuration, roll-restart the Hive service.
Procedure
- Log in to the node where the client is installed in the source cluster as the Hive client installation user.
- Run the following command to switch to the client installation directory, for example, /opt/hadoopclient:
cd /opt/hadoopclient
- Run the following command to configure environment variables:
source bigdata_env
- If Kerberos authentication is enabled for the cluster, run the following command to authenticate the user. Otherwise, skip this step.
kinit Hive service user
- Run the following command to log in to the Hive client:
beeline
- Run the following command to create the dump_db database:
create database dump_db;
- Run the following command to switch to the dump_db database:
use dump_db;
- Run the following command to create the test table in the dump_db database:
create table test(id int);
- Run the following command to insert data to the test table:
insert into test values(123);
- Run the following command to set the dump_db database as the source of the replication policy:
alter database dump_db set dbproperties ('repl.source.for'='replpolicy1');
- You must have related permissions on the database when running the alter command to modify database attributes. To configure permissions, perform the following operations:
- If Ranger authentication is not enabled for the cluster, log in to FusionInsight Manager to grant the administrator rights of the role to which the user belongs. For details, see section Creating a Hive Role.
- If Ranger authentication is enabled for the cluster, grant users the permission to dump and load databases. For details, see Adding a Ranger Access Permission Policy for Hive.
- Databases with replication policy sources configured can be deleted only after their replication policy sources are set to null. To do so, run the following command:
alter database dump_db set dbproperties ('repl.source.for'='');
- You must have related permissions on the database when running the alter command to modify database attributes. To configure permissions, perform the following operations:
- Run the following command to export the dump_db database to the /user/hive/test directory of the destination cluster:
repl dump dump_db with ('hive.repl.rootdir'='hdfs://haclusterX/user/hive/test');
The following is an example.
- hacluster X is the value of haclusterX in new custom parameterdfs.namenode.rpc-address.haclusterX.
- Ensure that the current user has the read and write permissions on the export directory to be specified.
- Log in to the node where the client is installed in the destination cluster as the Hive client installation user, and perform Step 2 to Step 5.
- Run the following command to import data from the dump_db database in the /user/hive/test directory to the load_db database:
repl load load_db from 'hdfs://haclusterX/user/hive/test/XXX';
- Replace hdfs://haclusterX/user/hive/test/XXX with the specific path in the dump_dir column in Step 11.
- When the repl load command is used to import a database, pay attention to the following points when specifying the database name:
- If the specified database does not exist, the database will be created during the import.
- If the specified database exists and the value of hive.repl.ckpt.key of the database is the same as the imported path, skip the import operation.
- If the specified database already exists and no table or function exists in this database, only the tables in the source database are imported to the current database during the import. Otherwise, the import fails.
Enabling Hive Interception of Concurrent Overwrite
Scenario
INSERT OVERWRITE of table data is available in Hive. Historical data of tables or partitions is cleared before final data is written. If you perform INSERT OVERWRITE concurrently on the same table or partition, two tasks may delete the same historical data at the same time, or some data files written by a task are deleted by other tasks. This causes data disorder.
To avoid this problem, you can enable interception of concurrent INSERT OVERWRITE.
When the interception function will access ZooKeeper, which lowers query performance. If you do not perform concurrent INSERT OVERWRITE on tables and partitions, or use ALTER TABLE CONCATENATE for small file merging, you do not need to enable this function.
Procedure
- Log in to FusionInsight Manager and choose Cluster > Services > Hive > Configurations.
- Search for hive-ext.concurrency.overwrite.reject in and set its value to true.
- Click Save, and then click OK to save the configuration.
- Click Instance, select all instances, choose More > Restart Instance, enter the password, and click OK to restart all Hive instances.
After this function is enabled, the ZooKeeper service is used by default to record information about tables and partitions where INSERT OVERWRITE is executed. If other ZooKeeper services are needed for better performance, you need to configure the following Hive parameters:
- Change the value of hive.zookeeper.quorum.for.overwrite to the host name of the ZooKeeper you want to use.
- Change the value of hive.zookeeper.client.port.for.overwrite to the ZooKeeper port.
Hive Log Overview
Log Description
Log path: The default save path of Hive logs is /var/log/Bigdata/hive/role name, the default save path of Hive1 logs is /var/log/Bigdata/hive1/role name, and the others follow the same rule.
- HiveServer: /var/log/Bigdata/hive/hiveserver (run log) and var/log/Bigdata/audit/hive/hiveserver (audit log)
- MetaStore: /var/log/Bigdata/hive/metastore (run log) and /var/log/Bigdata/audit/hive/metastore (audit log)
- WebHCat: /var/log/Bigdata/hive/webhcat (run log) and /var/log/Bigdata/audit/hive/webhcat (audit log)
Log archive rule: The automatic compression and archiving function of Hive is enabled. By default, when the size of a log file exceeds 40 MB (which is adjustable), the log file is automatically compressed. The naming rule of a compressed log file is as follows: <Original log name>-<yyyy-mm-dd_hh-mm-ss>.[ID].log.zip A maximum of 20 latest compressed files are reserved. The number of compressed files and compression threshold can be configured.
Log Type |
Log File Name |
Description |
---|---|---|
Run log |
/hiveserver/hiveserver.out |
Log file that records HiveServer running environment information. |
/hiveserver/hive.log |
Run log file of the HiveServer process. |
|
/hiveserver/hive-omm-<Date>-<PID>-gc.log.<No.> |
GC log file of the HiveServer process. |
|
/hiveserver/prestartDetail.log |
Work log file before the HiveServer startup. |
|
/hiveserver/check-serviceDetail.log |
Log file that records whether the Hive service starts successfully |
|
/hiveserver/cleanupDetail.log |
Cleanup log file about the HiveServer uninstallation |
|
/hiveserver/startDetail.log |
Startup log file of the HiveServer process. |
|
/hiveserver/stopDetail.log |
Shutdown log file of the HiveServer process. |
|
/hiveserver/localtasklog/omm_<Date>_<Task ID>.log |
Run log file of the local Hive task. |
|
/hiveserver/localtasklog/omm_<Date>_<Task ID>-gc.log.<No.> |
GC log file of the local Hive task. |
|
/metastore/metastore.log |
Run log file of the MetaStore process. |
|
/metastore/hive-omm-<Date>-<PID>-gc.log.<No.> |
GC log file of the MetaStore process. |
|
/metastore/postinstallDetail.log |
Work log file after the MetaStore installation. |
|
/metastore/prestartDetail.log |
Work log file before the MetaStore startup |
|
/metastore/cleanupDetail.log |
Cleanup log file of the MetaStore uninstallation |
|
/metastore/startDetail.log |
Startup log file of the MetaStore process. |
|
/metastore/stopDetail.log |
Shutdown log file of the MetaStore process. |
|
/metastore/metastore.out |
Log file that records MetaStore running environment information. |
|
/webhcat/webhcat-console.out |
Log file that records the normal start and stop of the WebHCat process. |
|
/webhcat/webhcat-console-error.out |
Log file that records the start and stop exceptions of the WebHCat process. |
|
/webhcat/prestartDetail.log |
Work log file before the WebHCat startup. |
|
/webhcat/cleanupDetail.log |
Cleanup logs generated during WebHCat uninstallation or before WebHCat installation |
|
/webhcat/hive-omm-<Date>-<PID>-gc.log.<No.> |
GC log file of the WebHCat process. |
|
/webhcat/webhcat.log |
Run log file of the WebHCat process |
|
Audit log |
hive-audit.log hive-rangeraudit.log |
HiveServer audit log file |
queryinfo.log |
HiveServer quer log, which records SQL running statistics and SQL interception information. |
|
metastore-audit.log |
MetaStore audit log file. |
|
webhcat-audit.log |
WebHCat audit log file. |
|
jetty-<Date>.request.log |
Request logs of the jetty service. |
Log Levels
Table 2-817 describes the log levels supported by Hive.
Levels of run logs are ERROR, WARN, INFO, and DEBUG from the highest to the lowest priority. Run logs of equal or higher levels are recorded. The higher the specified log level, the fewer the logs recorded.
Level |
Description |
---|---|
ERROR |
Logs of this level record error information about system running. |
WARN |
Logs of this level record exception information about the current event processing. |
INFO |
Logs of this level record normal running status information about the system and events. |
DEBUG |
Logs of this level record the system information and system debugging information. |
To modify log levels, perform the following operations:
- Go to the All Configurations page of the Yarn service by referring to Modifying Cluster Service Configuration Parameters.
- On the menu bar on the left, select the log menu of the target role.
- Select a desired log level and save the configuration.
The Hive log level takes effect immediately after being configured. You do not need to restart the service.
Log Formats
The following table lists the Hive log formats:
Log Type |
Format |
Example |
---|---|---|
Run log |
<yyyy-MM-dd HH:mm:ss,SSS>|<LogLevel>|<Thread that generates the log>|<Message in the log>|<Location of the log event> |
2014-11-05 09:45:01,242 | INFO | main | Starting hive metastore on port 21088 | org.apache.hadoop.hive.metastore.HiveMetaStore.main(HiveMetaStore.java:5198) |
Audit log |
<yyyy-MM-dd HH:mm:ss,SSS>|<LogLevel>|<Thread that generates the log>|<User Name><User IP><Time><Operation><Resource><Result><Detail >|< Location of the log event > |
2018-12-24 12:16:25,319 | INFO | HiveServer2-Handler-Pool: Thread-185 | UserName=hive UserIP=10.153.2.204 Time=2018/12/24 12:16:25 Operation=CloseSession Result=SUCCESS Detail= | org.apache.hive.service.cli.thrift.ThriftCLIService.logAuditEvent(ThriftCLIService.java:434) |
Hive Performance Tuning
Creating Table Partitions
Scenario
During the Select query, Hive generally scans the entire table, which is time-consuming. To improve query efficiency, create table partitions based on service requirements and query dimensions.
Procedure
- Log in to the node where the Hive client has been installed as user root.
- Run the following command to go to the client installation directory, for example, /opt/hadoopclient:
cd /opt/hadoopclient
- Run the source bigdata_env command to configure environment variables for the client.
- Run the following command on the client for login:
kinit Username
- Run the following command to log in to the client tool:
beeline
- Select the static or dynamic partition.
- Static partition:
Manually enter a partition name, and use the keyword PARTITIONED BY to specify partition column name and data type when creating a table. During application development, use the ALTER TABLE ADD PARTITION statement to add a partition and use the LOAD DATA INTO PARTITION statement to load data to the partition, which supports only static partitions.
- Dynamic partition: Use a query command to insert results to a partition of a table. The partition can be a dynamic partition.
The dynamic partition can be enabled on the client tool by running the following command:
set hive.exec.dynamic.partition=true;
The default mode of the dynamic partition is strict. That is, at least a column must be specified as a static partition, under which dynamic sub-partitions can be created. You can run the following command to enable a completely dynamic partition:
set hive.exec.dynamic.partition.mode=nonstrict;
- The dynamic partition may cause a DML statement to create a large number of partitions and new mapping folders, which deteriorates system performance.
- If there are a large number of files, it takes a long time to run a SQL statement. You can run the set mapreduce.input.fileinputformat.list-status.num-threads = 100; statement before running a SQL statement to shorten the time. The parameter mapreduce.input.fileinputformat.list-status.num-threads can be set only after being added to the Hive whitelist.
- Static partition:
Optimizing Join
Scenario
When the Join statement is used, the command execution speed and query speed may be slow in case of large data volume. To resolve this problem, you can optimize Join.
Join optimization can be classified into the following modes:
- Map Join
- Sort Merge Bucket Map Join
- Optimizing Join Sequences
Map Join
Hive Map Join applies to small tables (the table size is less than 25 MB) that can be stored in the memory. The table size can be defined using hive.mapjoin.smalltable.filesize, and the default table size is 25 MB.
Map Join has two methods:
- Use /*+ MAPJOIN(join_table) */.
- Set the following parameter before running the statement. The default value is true in the current version.
set hive.auto.convert.join=true;
There is no Reduce task when Map Join is used. Instead, a MapReduce Local Task is created before the Map job. The task uses TableScan to read small table data to the local computer, saves and writes the data in HashTable mode to a hard disk on the local computer, upload the data to DFS, and saves the data in distributed cache. The small table data that the map task reads from the local disk or distributed cache is the output together with the large table join result.
When using Map Join, make sure that the size of small tables cannot be too large. If small tables use up memory, the system performance will deteriorate and even memory leakage occurs.
Sort Merge Bucket Map Join
The following conditions must be met before using Sort Merge Bucket Map Join:
- The two Join tables are large and cannot be stored in the memory.
- The two tables are bucketed (clustered by (column)) and sorted (sorted by(column)) according to the join key, and the buckets counts of the two tables are in integral multiple relationship.
Set the following parameters to enable Sort Merge Bucket Map Join:
set hive.optimize.bucketmapjoin=true;
set hive.optimize.bucketmapjoin.sortedmerge=true;
This type of Map Join does not have Reduce tasks too. A MapReduce Local Task is started before the Map job to read small table data by bucket to the local computer. The local computer saves the HashTable backup of multiple buckets and writes the backup into HDFS. The backup is also saved in the distributed cache. The small table data that the map task reads from the local disk or distributed cache by bucket is the output after mapping with the large table.
Optimizing Join Sequences
If the Join operation is to be performed on three or more tables and different Join sequences are used, the execution time will be greatly different. Using an appropriate Join sequence can shorten the time for task execution.
Rules of a Join sequence:
- A table with small data volume or a combination with fewer results generated after a Join operation is executed first.
- A table with large data volume or a combination with more results generated after a Join operation is executed later.
For example, the customer table has the largest data volume, and fewer results will be generated if a Join operation is performed on the orders and lineitem tables first.
The original Join statement is as follows.
select l_orderkey, sum(l_extendedprice * (1 - l_discount)) as revenue, o_orderdate, o_shippriority from customer, orders, lineitem where c_mktsegment = 'BUILDING' and c_custkey = o_custkey and l_orderkey = o_orderkey and o_orderdate < '1995-03-22' and l_shipdate > '1995-03-22' limit 10;
After the sequence is optimized, the Join statements are as follows:
select l_orderkey, sum(l_extendedprice * (1 - l_discount)) as revenue, o_orderdate, o_shippriority from orders, lineitem, customer where c_mktsegment = 'BUILDING' and c_custkey = o_custkey and l_orderkey = o_orderkey and o_orderdate < '1995-03-22' and l_shipdate > '1995-03-22' limit 10;
Precautions
Join Data Skew Problem
Data skew refers to the symptom that the task progress is 99% for a long time.
Data skew often exists because the data volume of a few Reduce tasks is much larger than that of others. Most Reduce tasks are complete while a few Reduce tasks are not complete.
To resolve the data skew problem, set hive.optimize.skewjoin=true and adjust the value of hive.skewjoin.key. hive.skewjoin.key specifies the maximum number of keys received by a Reduce task. If the number reaches the maximum, the keys are atomically distributed to other Reduce tasks.
Optimizing Group By
Scenario
Optimize the Group by statement to accelerate the command execution and query speed.
During the Group by operation, Map performs grouping and distributes the groups to Reduce; Reduce then performs grouping again. Group by optimization can be performed by enabling Map aggregation to reduce Map output data volume.
Procedure
On a Hive client, set the following parameter:
set hive.map.aggr=true
Precautions
Group By Data Skew
Group by have data skew problems. When hive.groupby.skewindata is set to true, the created query plan has two MapReduce jobs. The Map output result of the first job is randomly distributed to Reduce tasks, and each Reduce task performs aggregation operations and generates output result. Such processing may distribute the same Group By Key to different Reduce tasks for load balancing purpose. According to the preprocessing result, the second Job distributes Group By Key to Reduce to complete the final aggregation operation.
Count Distinct Aggregation Problem
When the aggregation function count distinct is used in deduplication counting, serious Reduce data skew occurs if the processed value is empty. The empty value can be processed independently. If count distinct is used, exclude the empty value using the where statement and increase the last count distinct result by 1. If there are other computing operations, process the empty value independently and then combine the value with other computing results.
Optimizing Data Storage
Scenario
ORC is an efficient column storage format and has higher compression ratio and reading efficiency than other file formats.
You are advised to use ORC as the default Hive table storage format.
Prerequisites
You have logged in to the Hive client. For details, see Using a Hive Client.
Procedure
- Recommended: SNAPPY compression, which applies to scenarios with even compression ratio and reading efficiency requirements.
Create table xx (col_name data_type) stored as orc tblproperties ("orc.compress"="SNAPPY");
- Available: ZLIB compression, which applies to scenarios with high compression ratio requirements.
Create table xx (col_name data_type) stored as orc tblproperties ("orc.compress"="ZLIB");
xx indicates the specific Hive table name.
Optimizing SQL Statements
Scenario
When SQL statements are executed on Hive, if the (a&b) or (a&c) logic exists in the statements, you are advised to change the logic to a & (b or c).
Example
If condition a is p_partkey = l_partkey, the statements before optimization are as follows:
select sum(l_extendedprice* (1 - l_discount)) as revenue from lineitem, part where ( p_partkey = l_partkey and p_brand = 'Brand#32' and p_container in ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG') and l_quantity >= 7 and l_quantity <= 7 + 10 and p_size between 1 and 5 and l_shipmode in ('AIR', 'AIR REG') and l_shipinstruct = 'DELIVER IN PERSON' ) or ( p_partkey = l_partkey and p_brand = 'Brand#35' and p_container in ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK') and l_quantity >= 15 and l_quantity <= 15 + 10 and p_size between 1 and 10 and l_shipmode in ('AIR', 'AIR REG') and l_shipinstruct = 'DELIVER IN PERSON' ) or ( p_partkey = l_partkey and p_brand = 'Brand#24' and p_container in ('LG CASE', 'LG BOX', 'LG PACK', 'LG PKG') and l_quantity >= 26 and l_quantity <= 26 + 10 and p_size between 1 and 15 and l_shipmode in ('AIR', 'AIR REG') and l_shipinstruct = 'DELIVER IN PERSON' )
The statements after optimization are as follows:
select sum(l_extendedprice* (1 - l_discount)) as revenue from lineitem, part where p_partkey = l_partkey and (( p_brand = 'Brand#32' and p_container in ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG') and l_quantity >= 7 and l_quantity <= 7 + 10 and p_size between 1 and 5 and l_shipmode in ('AIR', 'AIR REG') and l_shipinstruct = 'DELIVER IN PERSON' ) or ( p_brand = 'Brand#35' and p_container in ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK') and l_quantity >= 15 and l_quantity <= 15 + 10 and p_size between 1 and 10 and l_shipmode in ('AIR', 'AIR REG') and l_shipinstruct = 'DELIVER IN PERSON' ) or ( p_brand = 'Brand#24' and p_container in ('LG CASE', 'LG BOX', 'LG PACK', 'LG PKG') and l_quantity >= 26 and l_quantity <= 26 + 10 and p_size between 1 and 15 and l_shipmode in ('AIR', 'AIR REG') and l_shipinstruct = 'DELIVER IN PERSON' ))
Optimizing the Query Function Using Hive CBO
Scenario
When joining multiple tables in Hive, Hive supports Cost-Based Optimization (CBO). The system automatically selects the optimal plan based on the table statistics, such as the data volume and number of files, to improve the efficiency of joining multiple tables. Hive needs to collect table statistics before CBO optimization.
- The CBO optimizes the joining sequence based on statistics and search criteria. However, the joining sequence may fail to be optimized in some special scenarios, such as data skew occurs and query condition values are not in the table.
- When column statistics collection is enabled, Reduce operations must be performed for aggregation. For insert tasks without the Reduce phase, Reduce operations will be performed to collect statistics.
Prerequisites
You have logged in to the Hive client. For details, see Using a Hive Client.
Procedure
- On the Manager UI, search for the hive.cbo.enable parameter in the service configuration of the Hive component, and select true to enable the function permanently.
- Collect statistics about the existing data in Hive tables manually.
Run the following command to manually collect statistics: Statistics about only one table can be collected. If statistics about multiple tables need to be collected, the command needs to be executed repeatedly.
ANALYZE TABLE [db_name.]tablename [PARTITION(partcol1[=val1], partcol2[=val2], ...)]
COMPUTE STATISTICS
[FOR COLUMNS]
[NOSCAN];
- When FOR COLUMNS is specified, column-level statistics are collected.
- When NOSCAN is specified, statistics about the file size and number of files will be collected, but specific files will not be scanned.
For example:
analyze table table_name compute statistics;
analyze table table_name compute statistics for columns;
- Configure the automatic statistics collection function of Hive. After the function is enabled, new statistics will be collected only when you insert data by running the insert overwrite/into command.
- Run the following commands on the Hive client to enable the statistics collection function temporarily:
set hive.stats.autogather = true; enables the automatic collection of table/partition-level statistics.
set hive.stats.column.autogather = true; enables the automatic collection of column-level statistics.
- The column-level statistics collection does not support complex data types, such as Map and Struct.
- The automatic table-level statistics collection does not support Hive on HBase tables.
- On the Manager UI, search for the hive.stats.autogather and hive.stats.column.autogather parameters in the service configuration of Hive, and select true to enable the collection function permanently.
- Run the following commands on the Hive client to enable the statistics collection function temporarily:
- Run the following command to view statistics:
DESCRIBE FORMATTED table_name[.column_name] PARTITION partition_spec;
For example:
desc formatted table_name;
desc formatted table_name id;
desc formatted table_name partition(time='2016-05-27');
Partition tables only support partition-level statistics collection, so you must specify partitions to query statistics for partition tables.
Common Issues About Hive
How Do I Delete UDFs on Multiple HiveServers at the Same Time?
Question
How can I delete permanent user-defined functions (UDFs) on multiple HiveServers at the same time?
Answer
Multiple HiveServers share one MetaStore database. Therefore, there is a delay in the data synchronization between the MetaStore database and the HiveServer memory. If a permanent UDF is deleted from one HiveServer, the operation result cannot be synchronized to the other HiveServers promptly.
In this case, you need to log in to the Hive client to connect to each HiveServer and delete permanent UDFs on the HiveServers one by one. The operations are as follows:
- Log in to the node where the Hive client is installed as the Hive client installation user.
- Run the following command to go to the client installation directory:
cd Client installation directory
For example, if the client installation directory is /opt/hadoopclient, run the following command:
cd /opt/hadoopclient
- Run the following command to configure environment variables:
source bigdata_env
- Run the following command to authenticate the user:
kinit Hive service user
The login user must have the Hive admin rights.
- Run the following command to connect to the specified HiveServer:
beeline -u "jdbc:hive2://10.39.151.74:21066/default;sasl.qop=auth-conf;auth=KERBEROS;principal=hive/hadoop.<system domain name>@<system domain name>"
- 10.39.151.74 is the IP address of the node where the HiveServer is located.
- 21066 is the port number of the HiveServer. The HiveServer port number ranges from 21066 to 21070 by default. Use the actual port number.
- hive is the username. For example, if the Hive1 instance is used, the username is hive1.
- You can log in to FusionInsight Manager, choose System > Permission > Domain and Mutual Trust, and view the value of Local Domain, which is the current system domain name.
- hive/hadoop.<system domain name> is the username. All letters in the system domain name contained in the username are lowercase letters.
- Run the following command to enable the Hive admin rights:
set role admin;
- Run the following command to delete the permanent UDF:
drop function function_name;
- function_name indicates the name of the permanent function.
- If the permanent UDF is created in Spark, the permanent UDF needs to be deleted from Spark and then from HiveServer by running the preceding command.
- Check whether the permanent UDFs are deleted from all HiveServers.
- If yes, no further action is required.
- If no, go to Step 5.
Why Cannot the DROP operation Be Performed on a Backed-up Hive Table?
Question
Why cannot the DROP operation be performed for a backed up Hive table?
Answer
Snapshots have been created for an HDFS directory mapping to the backed up Hive table, so the HDFS directory cannot be deleted. As a result, the Hive table cannot be deleted.
When a Hive table is being backed up, snapshots are created for the HDFS directory mapping to the table. The snapshot mechanism of HDFS has the following limitation: If snapshots have been created for an HDFS directory, the directory cannot be deleted or renamed unless the snapshots are deleted. When the DROP operation is performed for a Hive table (except the EXTERNAL table), the system attempts to delete the HDFS directory mapping to the table. If the directory fails to be deleted, the system displays a message indicating that the table fails to be deleted.
If you need to delete this table, manually delete all backup tasks related to this table. For details, see Viewing Backup and Restoration Tasks.
How to Perform Operations on Local Files with Hive User-Defined Functions
Question
How to perform operations on local files (such as reading the content of a file) with Hive user-defined functions?
Answer
By default, you can perform operations on local files with their relative paths in UDF. The following are sample codes:
public String evaluate(String text) { // some logic File file = new File("foo.txt"); // some logic // do return here }
In Hive, upload the file foo.txt used in UDF to HDFS, such as hdfs://hacluster/tmp/foo.txt. You can perform operations on the foo.txt file by creating UDF with the following sentences:
create function testFunc as 'some.class' using jar 'hdfs://hacluster/somejar.jar', file 'hdfs://hacluster/tmp/foo.txt';
In abnormal cases, if the value of hive.fetch.task.conversion is more, you can perform operations on local files in UDF by using absolute path instead of relative path. In addition, you must ensure that the file exists on all HiveServer nodes and NodeManager nodes and omm user have corresponding operation rights.
How Do I Forcibly Stop MapReduce Jobs Executed by Hive?
Question
How do I stop a MapReduce task manually if the task is suspended for a long time?
Answer
- Log in to FusionInsight Manager.
- Choose Cluster > Name of the desired cluster > Services > Yarn.
- On the left pane, click ResourceManager(Host name, Active), and log in to Yarn.
- Click the button corresponding to the task ID. On the task page that is displayed, click Kill Application in the upper left corner and click OK in the displayed dialog box to stop the task.
Table Creation Fails Because Hive Complex Fields' Names Contain Special Characters
Question
Table creation fails because Hive complex fields' names contain special characters.
Answer
Hive does not support complex fields' names that contain special characters. Special characters refer to characters other than uppercase and lowercase letters, digits, Chinese characters, and Portuguese characters.
How Do I Monitor the Hive Table Size?
Question
How do I monitor the Hive table size?
Answer
The HDFS refined monitoring function allows you to monitor the size of a specified table directory.
Prerequisites
- The Hive and HDFS components are running properly.
- The HDFS refined monitoring function is normal.
Procedure
- Log in to FusionInsight Manager.
- Choose Cluster > Services > HDFS and click Resource.
- Click the first icon in the upper left corner of Resource Usage (by Directory), as shown in the following figure.
- In the displayed sub page for configuring space monitoring, click Add.
- In the displayed Add a Monitoring Directory dialog box, set Name to the name or the user-defined alias of the table to be monitored and Path to the path of the monitored table. Click OK. In the monitoring result, the horizontal coordinate indicates the time, and the vertical coordinate indicates the size of the monitored directory.
How Do I Prevent Key Directories from Data Loss Caused by Misoperations of the insert overwrite Statement?
Question
How do I prevent key directories from data loss caused by misoperations of the insert overwrite statement?
Answer
During monitoring of key Hive databases, tables, or directories, to prevent data loss caused by misoperations of the insert overwrite statement, configure hive.local.dir.confblacklist in Hive to protect directories.
This configuration item has been configured for directories such as /opt/ and /user/hive/warehouse by default.
Prerequisites
The Hive and HDFS components are running properly.
Procedure
- Log in to FusionInsight Manager.
- Choose Cluster > Services > Hive. Click Configurations then All Configurations, and search for the hive.local.dir.confblacklist parameter.
- Add paths of databases, tables, or directories to be protected in the parameter value.
- Click Save to save the settings.
Why Is Hive on Spark Task Freezing When HBase Is Not Installed?
Scenario
This function applies to Hive.
Perform the following operations to configure parameters. When Hive on Spark tasks are executed in the environment where the HBase is not installed, freezing of tasks can be prevented.
The Spark kernel version of Hive on Spark tasks has been upgraded to Spark. Hive on Spark tasks can be executed without installing Spark. If HBase is not installed, when Spark tasks are executed, the system attempts to connect to the ZooKeeper to access HBase until timeout occurs by default. As a result, task freezing occurs.
If HBase is not installed, perform the following operations to execute Hive on Spark tasks. If HBase is upgraded from an earlier version, you do not need to configure parameters after the upgrade.
Procedure
- Log in to FusionInsight Manager.
- Choose Cluster > Services > Hive. Click Configurations then All Configurations.
- Choose HiveServer(Role) > Customization. Add a customized parameter to the spark-defaults.conf parameter file. Set Name to spark.security.credentials.hbase.enabled, and set Value to false.
- Click Save. In the dialog box that is displayed, click OK.
- Choose Cluster > Services > Hive. Click Instance, select all Hive instances, click More , and select Restart Instance. In the dialog box displayed, enter the password, and click OK.
Error Reported When the WHERE Condition Is Used to Query Tables with Excessive Partitions in FusionInsight Hive
Question
Caused by: java.io.IOException: Tried to send an out-of-range integer as a 2-byte value: 32970 at org.postgresql.core.PGStream.SendInteger2(PGStream.java:199) at org.postgresql.core.v3.QueryExecutorImpl.sendParse(QueryExecutorImpl.java:1330) at org.postgresql.core.v3.QueryExecutorImpl.sendOneQuery(QueryExecutorImpl.java:1601) at org.postgresql.core.v3.QueryExecutorImpl.sendParse(QueryExecutorImpl.java:1191) at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:346)
Answer
During a query with partition conditions, HiveServer optimizes the partitions to avoid full table scanning. All partitions whose metadata meets the conditions need to be queried. However, the sendOneQuery interface provided by GaussDB limits the parameter value to 32767 in the sendParse method. If the number of partition conditions exceeds 32767, an exception occurs. If you have to query a large number of partitions in a single SQL statement, see Procedure.
Procedure
- Log in to FusionInsight Manager, choose Clusters > Services > Hive. On the displayed page, click the Configurations tab and select All Configurations. Choose MetaStore(Role) > Customization, and add metastore.direct.sql.batch.size and its value 10000 to the parameter file hivemetastore-site.xml.
- Click Save. In the dialog box that is displayed, click OK.
- Click the Instance tab, select all MetaStore instances, click More > Restart Instance, enter the administrator password, and click OK to restart the MetaStore instances.
Why Cannot I Connect to HiveServer When I Use IBM JDK to Access the Beeline Client?
Scenario
When users check the JDK version used by the client, if the JDK version is IBM JDK, the Beeline client needs to be reconstructed. Otherwise, the client will fail to connect to HiveServer.
Procedure
- Log in to FusionInsight Manager and choose System > Permission > User. In the Operation column of the target user, choose More > Download Authentication Credential, select the cluster information, and click OK to download the keytab file.
- Decompress the keytab file and use WinSCP to upload the decompressed user.keytab file to the Hive client installation directory on the node to be operated, for example, /opt/hadoopclient.
- Run the following command to open the Hive/component_env configuration file in the Hive client directory:
vi Hive client installation directory/Hive/component_env
Add the following content to the end of the line where export CLIENT_HIVE_URI is located:\; user.principal=Username @HADOOP.COM\;user.keytab=user.keytab file path/user.keytab
Why Cannot Data Be Queried After the MapReduce Engine Is Switched After the Tez Engine Is Used to Execute Union-related Statements?
Question
Hive uses the Tez engine to execute union-related statements to write data. After Hive is switched to the MapReduce engine for query, no data is found.
Answer
When Hive uses the Tez engine to execute the union-related statement, the generated output file is stored in the HIVE_UNION_SUBDIR directory. After Hive is switched back to the MapReduce engine, files in the directory are not read by default. Therefore, data in the HIVE_UNION_SUBDIR directory is not read.
In this case, you can set mapreduce.input.fileinputformat.input.dir.recursive to true to enable union optimization and determine whether to read data in the directory.
Why Does Hive Not Support Vectorized Query?
Question
When the vectorized parameter hive.vectorized.execution.enabled is set to true, why do some null pointers or type conversion exceptions occur occasionally when Hive on Tez/MapReduce/Spark is executed?
Answer
Currently, Hive does not support vectorized execution. Many community issues are introduced during vectorized execution and are not resolved stably. The default value of hive.vectorized.execution.enabled is false. You are advised not to set this parameter to true.
How Do I Disable the Logging Function of Hive?
Question
How do I disable the logging function of Hive?
Answer
- Log in to the node where the client is installed as user root.
- Run the following command to switch to the client installation directory, for example, /opt/hadoopclient:
cd /opt/hadoopclient
- Run the following command to configure environment variables:
source bigdata_env
- Log in to the Hive client based on the cluster authentication mode.
- In security mode, run the following command to complete user authentication and log in to the Hive client:
kinit Component service user
beeline
- In normal mode, run the following command to log in to the Hive client:
- Run the following command to log in to the Hive client as the component service user:
beeline -n component service user
- If no component service user is specified, the current OS user is used to log in to the Hive client.
beeline
- Run the following command to log in to the Hive client as the component service user:
- In security mode, run the following command to complete user authentication and log in to the Hive client:
- Run the following command to disable the logging function:
set hive.server2.logging.operation.enabled=false;
- Run the following command to check whether the logging function is disabled. If the following information is displayed, the logging function is disabled successfully.
set hive.server2.logging.operation.enabled;
Why Do Metastore Instances Failed to Be Restarted After They Are Deleted and Added Again?
Question
Why do Metastore instances failed to be restarted after they are deleted and added again?
Answer
Possible causes of the MetaStore instance restart failure are as follows:
- After MetaStore is added again, the configuration of HiveServer and WebHCat is not updated, and the old password is still used for connection. After multiple failures, user Hive of the DBService database is locked.
- Metabase HIVEMETA can only be connected to Hive through MetaStore. Before deleting MetaStore instances, the MetaStore instances will be stopped and Hive-related connections will be disconnected. When all MetaStore instances are deleted, the Hive metabase HIVEMETA will be deleted. However, if other users or components connect to the HIVEMETA metabase, the HIVEMETA metabase cannot be deleted. As a result, MetaStore instances fail to be added again.
In this case, perform the following steps:
- Log in to FusionInsight Manager, choose Cluster > Services > DBService, click the Instance tab, and view the IP address of the active DBServer node.
- Log in to the active DBServer node as user root and run the following commands, in which XXX indicates the password of user omm of the DBService database. Contact the administrator to obtain the password.
su - omm
source ${BIGDATA_HOME}/FusionInsight_Current/*DBServer/install/.dbservice_profile
gsql -d POSTGRES -U omm -p 20051 -W XXX
- Run the following command to unlock user hive:
alter role hive account unlock;
- Log in to FusionInsight Manager and choose Cluster > Services > Hive. On the page that is displayed, choose More > Restart Service to restart the Hive service.
- If the restart is successful, no further action is required.
- If the restart fails, go to Step 5.
- Run the following command to query the connection to the HIVEMETA database:
select DATID,DATNAME,PID,USENAME,CLIENT_ADDR,CLIENT_HOSTNAME,CLIENT_PORT,STATE from pg_stat_activity where DATNAME='HIVEMETA';
- Stop the client that is connected to HIVEMETA based on the query result. If the connection is useless, run the kill -9 PID on the active DBServer node to forcibly delete the connection. PID indicates the value in the PID column in Step 5. In the following figure, PID is 17865.
- Log in to FusionInsight Manager, choose Cluster > Services > Hive, and click the Instance tab. On the displayed page, select all Metastore instances and choose More > Delete Instance.
- Click Add Instance to add the MetaStore instances again.
- After instances are added, select all HiveServer and WebHCat instances and choose More > Restart Instance.
Hive Configuration Problems
- The error message "java.lang.OutOfMemoryError: Java heap space." is displayed during Hive SQL execution.
Solution:
- For MapReduce tasks, increase the values of the following parameters:
set mapreduce.map.memory.mb=8192;
set mapreduce.map.java.opts=-Xmx6554M;
set mapreduce.reduce.memory.mb=8192;
set mapreduce.reduce.java.opts=-Xmx6554M;
- For Tez tasks, increase the value of the following parameter:
set hive.tez.container.size=8192;
- For MapReduce tasks, increase the values of the following parameters:
- After a column name is changed to a new one using the Hive SQL as statement, the error message "Invalid table alias or column reference 'xxx'." is displayed when the original column name is used for compilation.
Solution: Run the set hive.cbo.enable=true; statement.
- The error message "Unsupported SubQuery Expression 'xxx': Only SubQuery expressions that are top level conjuncts are allowed." is displayed during Hive SQL subquery compilation.
Solution: Run the set hive.cbo.enable=true; statement.
- The error message "CalciteSubquerySemanticException [Error 10249]: Unsupported SubQuery Expression Currently SubQuery expressions are only allowed as Where and Having Clause predicates." is displayed during Hive SQL subquery compilation.
Solution: Run the set hive.cbo.enable=true; statement.
- The error message "Error running query: java.lang.AssertionError: Cannot add expression of different type to set." is displayed during Hive SQL compilation.
Solution: Run the set hive.cbo.enable=false; statement.
- The error message "java.lang.NullPointerException at org.apache.hadoop.hive.ql.udf.generic.GenericUDAFComputeStats$GenericUDAFNumericStatsEvaluator.init." is displayed during Hive SQL execution.
Solution: Run the set hive.map.aggr=false; statement.
- When hive.auto.convert.join is set to true (enabled by default) and hive.optimize.skewjoin is set to true, the error message "ClassCastException org.apache.hadoop.hive.ql.plan.ConditionalWork cannot be cast to org.apache.hadoop.hive.ql.plan.MapredWork" is displayed.
Solution: Run the set hive.optimize.skewjoin=false; statement.
- When hive.auto.convert.join is set to true (enabled by default), hive.optimize.skewjoin is set to true, and hive.exec.parallel is set to true, the error message "java.io.FileNotFoundException: File does not exist:xxx/reduce.xml" is displayed.
Solution:
- Method 1: Switch the execution engine to Tez. For details, see Switching the Hive Execution Engine to Tez.
- Method 2: Run the set hive.exec.parallel=false; statement.
- Method 3: Run the set hive.auto.convert.join=false; statement.
- Eerror message "NullPointerException at org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.mergeJoinComputeKeys" is displayed when Hive on Tez executes bucket map join.
Solution: Run the set tez.am.container.reuse.enabled=false; statement.
How Do I Handle the Error Reported When Setting hive.exec.stagingdir on the Hive Client?
Scenario
By default, Hive client does not support the modification of the temporary directory created when data is inserted. If you run set hive.exec.stagingdir=xxx on the client to modify the temporary directory, the following error message is displayed:
Answer
hive.exec.stagingdir is used to set the temporary directory created when data is inserted. Data may be lost when concurrently inserted into a same table. By default, this parameter is not allowed. If no data will be inserted concurrently, you can set this parameter on the client. For details, see Procedure.
Procedure
- Log in to FusionInsight Manager, choose Clusters > Services > Hive. On the displayed page, click the Configurations tab and select All Configurations. Choose HiveServer(Role) > Customization, and add hive.conf.internal.variable.list and its value hive.added.files.path,hive.added.jars.path,hive.added.archives.path to the parameter file hive-site.xml.
- Click Save. In the dialog box that is displayed, click OK.
- Click the Instance tab, select all HiveServer instances, click More > Restart Instance, enter the administrator password, and click OK to restart the HiveServer instances.
Load Data Command Failed to Import a Specified Wildcard Matching File to a Dynamic Partitioned Table
Scenario
In the command for importing data to a Hive table, if a file is specified using wildcard characters and the target table is a partitioned table, the original file will not be overwritten. For example, the SQL statement below cannot overwrite the original files in the partitioned table:
load data inpath 'hdfs://hacluster/tmp/test*' overwrite into table testpartition;
Cause Analysis
The load data command creates a temporary table in the memory and specifies the location of this table as the path of the file to be imported. It overrides the load data statement to insert overwrite Target table select * from Temporary table. The target file path contains a wildcard, and the specific file cannot be found or be overwritten.
Procedure
To import a dynamic partitioned table, do not use wildcard to specify a file. You can specify a file in a directory and load the directory. The following is an example:
load data inpath '/tmp/testdir/' overwrite into table testpartition;
How Do I Optimize the INSERT OVERWRITE for Reading and Writing in Same Table?
Scenario
If data needs to be inserted to the destination table using dynamic partitioning (update using historical partitions) and the destination table is the same as the data source table, running INSERT OVERWRITE on the source table may cause data loss or data inconsistency. To avoid this problem, you are advised to use a temporary table to process data, and then perform the INSERT OVERWRITE.
Procedure
The following table is taken as an example:
user_data(user_group int, user_name string, update_time timestamp);
In this table, user_group is the partitioning column. You need to sort the existing data by update time and update the user group information. To do so, perform the following steps:
- On the Hive Beeline CLI, enable Hive dynamic partitioning.
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
- Create a temporary table for storing deduplicated data.
CREATE TABLE temp_user_data AS
SELECT * FROM (
SELECT *,
ROW_NUMBER() OVER(PARTITION BY user_group ORDER BY update_time DESC) as rank
FROM user_data
) tmp
WHERE rank = 1;
- Use temporary table as the data source and insert data to the destination table.
INSERT OVERWRITE TABLE user_data
SELECT user_group, user_name, update_time
FROM temp_user_data;
- Clear the temporary table.
DROP TABLE IF EXISTS temp_user_data;
- Using Hive from Scratch
- Configuring Hive Parameters
- Hive SQL
- Permission Management
- Using a Hive Client
- Using HDFS Colocation to Store Hive Tables
- Using the Hive Column Encryption Function
- Customizing Row Separators
- Deleting Single-Row Records from Hive on HBase
- Configuring HTTPS/HTTP-based REST APIs
- Enabling or Disabling the Transform Function
- Access Control of a Dynamic Table View on Hive
- Specifying Whether the ADMIN Permissions Is Required for Creating Temporary Functions
- Using Hive to Read Data in a Relational Database
- Supporting Traditional Relational Database Syntax in Hive
- Creating User-Defined Hive Functions
- Enhancing beeline Reliability
- Viewing Table Structures Using the show create Statement as Users with the select Permission
- Writing a Directory into Hive with the Old Data Removed to the Recycle Bin
- Inserting Data to a Directory That Does Not Exist
- Creating Databases and Creating Tables in the Default Database Only as the Hive Administrator
- Disabling of Specifying the location Keyword When Creating an Internal Hive Table
- Enabling the Function of Creating a Foreign Table in a Directory That Can Only Be Read
- Authorizing Over 32 Roles in Hive
- Restricting the Maximum Number of Maps for Hive Tasks
- HiveServer Lease Isolation
- Hive Supports Isolation of Metastore instances Based on Components
- Switching the Hive Execution Engine to Tez
- Interconnecting Hive with External Self-Built Relational Databases
- Redis-based CacheStore of HiveMetaStore
- Hive Supporting Reading Hudi Tables
- Hive Supporting Cold and Hot Storage of Partitioned Metadata
- Hive Supporting ZSTD Compression Formats
- Hive Partitioned Tables Support OBS and HDFS Storage Sources
- Locating Abnormal Hive Files
- Using the ZSTD_JNI Compression Algorithm to Compress Hive ORC Tables
- HiveMetaStore Access Pressure Reduction
- Load Balancing for Hive MetaStore Client Connection
- Data Import and Export in Hive
- Enabling Hive Interception of Concurrent Overwrite
- Hive Log Overview
- Hive Performance Tuning
- Common Issues About Hive
- How Do I Delete UDFs on Multiple HiveServers at the Same Time?
- Why Cannot the DROP operation Be Performed on a Backed-up Hive Table?
- How to Perform Operations on Local Files with Hive User-Defined Functions
- How Do I Forcibly Stop MapReduce Jobs Executed by Hive?
- Table Creation Fails Because Hive Complex Fields' Names Contain Special Characters
- How Do I Monitor the Hive Table Size?
- How Do I Prevent Key Directories from Data Loss Caused by Misoperations of the insert overwrite Statement?
- Why Is Hive on Spark Task Freezing When HBase Is Not Installed?
- Error Reported When the WHERE Condition Is Used to Query Tables with Excessive Partitions in FusionInsight Hive
- Why Cannot I Connect to HiveServer When I Use IBM JDK to Access the Beeline Client?
- Why Cannot Data Be Queried After the MapReduce Engine Is Switched After the Tez Engine Is Used to Execute Union-related Statements?
- Why Does Hive Not Support Vectorized Query?
- How Do I Disable the Logging Function of Hive?
- Why Do Metastore Instances Failed to Be Restarted After They Are Deleted and Added Again?
- Hive Configuration Problems
- How Do I Handle the Error Reported When Setting hive.exec.stagingdir on the Hive Client?
- Load Data Command Failed to Import a Specified Wildcard Matching File to a Dynamic Partitioned Table
- How Do I Optimize the INSERT OVERWRITE for Reading and Writing in Same Table?