No relevant resource is found in the selected language.

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy>Search

Reminder

To have a better experience, please upgrade your IE browser.

upgrade

FusionInsight HD V100R002C60SPC200 Product Description 06

Rate and give feedback:
Huawei uses machine translation combined with human proofreading to translate this document to different languages in order to help you better understand the content of this document. Note: Even the most advanced machine translation cannot match the quality of professional translators. Huawei shall not bear any responsibility for translation accuracy and it is recommended that you refer to the English document (a link for which has been provided).
Spark SQL Enhanced

Spark SQL Enhanced

Spark SQL Capability

Spark SQL capabilities are enhanced mainly in terms of Hive syntax compatibility and standard SQL syntax compatibility.

Hive Syntax Compatibility

Spark SQL provides a HiveContext that supports some Hive syntax:

Figure 4-18 Basic Architecture of HiveContext
  • HiveContext uses the Hive parser, and the Spark SQL self-implemented converter converts an abstract syntax tree (AST) parsed by the Hive parser into a primitive logical plan in a catalyst.
  • HiveContext is inherited from SQLContext. Therefore, its logical plans are processed in the catalyst, and only the SQL parser uses the Hive parser.

Hive syntax compatibility is tested by running 64 SQL statements in the Hive-Test-benchmark test set. Compatibility of the following fields is added as compared with Apache Spark:

Table 4-1 Fields added for Hive syntax compatibility

Field

Description

Support sub-queries embedded into the IN operator

The IN operator allows users to select a record about a certain field belonging to a certain set. Currently, the function of obtaining the set through sub-queries is not supported.

Support the OVER keyword

This field is used to create a window and must be used with the sort or aggregate function.

Support the WITH keyword

A WITH keyword can define an SQL fragment. It is equivalent to a temporary variable, which assigns an alias to an SQL statement to improve readability of the statement.

Support a sub-query in the EXISTS clause

An EXISTS clause can add a sub-query as the judgment condition in the where statement.

Standard SQL Syntax Compatibility

Figure 4-19 Standard SQL statement execution process

Standard SQL compatibility is enhanced by following the principle of not modifying catalog and SQLContext core module code. Instead, pluggable enhancement is achieved in HiveContext. The details are as follows:

  • SQL parser: implements pluggable SQL parser implementation and implements the parser corresponding to SQL99 syntax through the pluggable parser interface.
  • Logical Plan: An SQL99LogicalPlans class is added. All new logical plans are managed by this class without modifying original logical plans or adding logical plans to original ones.
  • Expressions: SQL99Expressions is added. All new expressions are added to this file for unified management without modifying original expressions or adding expressions to original ones.
  • Optimizer: In principle, the optimizer enhancement is not represented in a syntax enhancement solution. However, if the enhancement involves adding optimization rules, SQL99OptimizerRules is added.
  • Physical plan: SQL99SparkPlans is added. All new physical plans are added to this file for unified management without modifying original physical plans or adding physical plans to original ones.
  • Metadata management: Use Hive metastore to manage metadata.

The standard SQL syntax compatibility is tested by running 99 SQL statements in the tpc-ds test set.

Compatibility of the following fields is added as compared with Apache Spark:

Table 4-2 Fields added for standard SQL syntax compatibility

Field

Description

Support the TOP keyword

The TOP keyword, similar to the LIMIT keyword, is used to return the first n records of query results.

Support the WITH keyword

A WITH keyword can define an SQL fragment. It is equivalent to a temporary variable used to rename an SQL statement and improve readability of the statement.

Support the rank over function

Sort user-specified fields based on partition information specified by the OVER keyword.

Support string constants following the AS keyword

Use the content of a character string as an alias of a field.

Support date conversion in the CAST function

Convert other types of data into the date data type.

Support a sorting keyword in an ORDER BY clause

In the current SQLContext, you can either add a sorting keyword (desc or asc) for all fields in the ORDER BY clause or use the default one for all. Adding a keyword for some fields is not supported.

Support the ROLLUP operator

In a Group by ROLLUP(A, B, C) clause, the GROUP BY operation is first performed for (A, B, C), then for (A, B), for (A), and finally for the entire table. That is, ROLLUP enables a SELECT statement to calculate multiple levels of subtotals across a specified group of dimensions.

Support the GROUPING operator

Specify whether a certain expression in a GROUP BY clause is aggregated. If yes, the value 1 is returned; otherwise, the value 0 is returned.

Support the aggregate function expression in an ORDER BY clause

Support sorting based on computation results of the aggregate function in the ORDER BY clause.

Support EXISTS and NOT EXISTS clauses

An EXISTS clause can add a sub-query as the judgment condition in the where statement.

Support sub-queries in IN clauses

Check whether a certain field or expression exists in a sub-query result.

Support the OVER keyword

This field is used to create a window and must be used with the sort or aggregate function.

Support COALESCE

Return the first non-null value in the parameter list.

Support a complicated sub-query following a FROM clause

Currently, support for a sub-query following a FROM clause is not complete because an error occurs when the FROM clause contains two consecutive parentheses.

Support stddev_samp

Calculate the sample standard deviation.

Support sub-queries in a WHERE clause

Compare a field with a sub-query result.

Support dec

Convert a binary string into a numeric string.

Support sub-queries in CASE WHEN

Compare a sub-query result with other values.

Support the || operator.

String concatenation operator.

Support numeric

Convert a binary string into a numeric string.

Optimizing Data Skewness

In the Spark SQL multi-table join scenario,severe association skewnessmay occur. As a result, data in some buckets is far more than that in others after data distribution by using Hash. In this case, some tasks are overloaded and run slowly while other tasks are light and run fast. Heavy tasks run slowly hindering computing performance and light tasks will result in idle CPUs, wasting CPU resources.

To address this scenario, the original join is split to Common Join for processing non-skewed association keys and the MaP-Side Join (Broadcast Join), as shown in Figure 4-20.

Figure 4-20 Converting Skewed Data Join

After conversion, data that does not contain skewed keys can still be averagely distributed to tasks for processing. For data containing skewed keys, the part containing less data will be broadcasted, and Map-Side Join is used to averagely distribute data to tasks for processing, fully utilizing CPU resources and improving overall performance.

Optimizing Small Files

In Spark SQL tables, massive small files (far smaller than an HDFS block) will be generated. Each small file corresponds to a partition in the Spark, that is, a task. In this case, Spark will have lots of tasks to process. If Shuffle is involved in the SQL logic, the number of Hash buckets will be increased dramatically, severely hindering performance.

To address small files, the coalesce algorithm is adopted to merge partitions generated for small files in the table so as to reduce the number of partitions, and Hash buckets in case of Shuffle and improve performance, as shown in Figure 4-21.

Figure 4-21 Merging small files

Permission Management

In a safe cluster, SparkSQL provides a set of permission systems to manage and control the users' operation of databases and ensure that different users can operate databases independently and securely. A user can operate another user's tables and databases only with the corresponsive permissions. Otherwise, operations will be rejected.

  • For the Hive data that exist, the permission of SparkSQL follows the mechanisms and meta data of Hive. With permissions, users can continue to access the Hive data by using SparkSQL.
  • The permission model of SparkSQL consists of the meta data permission and HDFS file permission. If a user uses the SparkSQL to operate sql, the user must be granted with permissions of SparkSQL databases and tables (include external tables and views).
Translation
Download
Updated: 2019-04-10

Document ID: EDOC1000104139

Views: 6034

Downloads: 64

Average rating:
This Document Applies to these Products
Related Documents
Related Version
Share
Previous Next