No relevant resource is found in the selected language.

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy>Search

Reminder

To have a better experience, please upgrade your IE browser.

upgrade

FusionInsight HD 6.5.0 Product Description 02

Rate and give feedback:
Huawei uses machine translation combined with human proofreading to translate this document to different languages in order to help you better understand the content of this document. Note: Even the most advanced machine translation cannot match the quality of professional translators. Huawei shall not bear any responsibility for translation accuracy and it is recommended that you refer to the English document (a link for which has been provided).
CarbonData Overview

CarbonData Overview

CarbonData is a new Apache Hadoop native file format for faster interactive query using advanced column-based storage, index, compression, and encoding technologies to improve computing efficiency. CarbonData can speed up queries for PetaBytes or greater volume of data. CarbonData is also a high-performance analysis engine that integrates data source with Spark.

Figure 4-36 Architecture of CarbonData

The goal of CarbonData is to provide ultra-fast response to ad-hoc queries on big data. CarbonData is basically an OLAP engine and stores data as tables similar to tables in RDBMS. User can load more than 10 TB of data into CarbonData. CarbonData automatically organizes and stores data in column format in a compressed, multi-dimensional, and indexed manner. Once data is loaded into CarbonData, ad-hoc queries can be executed on data and CarbonData can respond to the query in seconds.

CarbonData integrates a data source into the Spark ecosystem and you can query and analyze the data using Spark SQL. Third-party tools can connect to Spark SQL using the JDBCServer provided along with Spark.

Topology of CarbonData Clusters

CarbonData runs as a data source inside Spark. Therefore, CarbonData does not start any additional processes on nodes in clusters. CarbonData engine runs inside the Spark executor.

Figure 4-37 Topology of CarbonData clusters

Data stored in CarbonData tables is divided into several CarbonData data files. The CarbonData engine module executes tasks, for example reading tables and filtering data in response to queries. CarbonData engine runs as a part of the Spark executor and is responsible for handling a subset of data blocks.

The table data is stored in HDFS. The Spark cluster nodes can be used as HDFS DataNodes.

CarbonData Features

  • SQL capability: CarbonData is compatible with Spark SQL and supports SQL queries run on Spark SQL.
  • Easy table definition: CarbonData supports easy way to use DDL (Data Definition Language) statements to define and create tables. CarbonData DDL is highly flexible and easy-to-use, and allows defining complex tables.
  • Easy data management: CarbonData supports a variety of data management functions, including loading data to tables and maintaining data in tables. CarbonData supports bulk loading of historical data and incremental loading of new data. Loaded data can be deleted based on load time and a specific loading operation can be undone.
  • CarbonData files are stored in column-based format in HDFS. The column-based storage format features table splitting, and compression. Features of CarbonData are as follows:
    • Stores data with indexes: CarbonData can significantly accelerate query performance and reduce the I/O scans and CPU resource consumption when there are filters in the query. CarbonData indexes are of multiple levels, enabling the processing framework to reduce tasks to be scheduled by leveraging indexes. The scanning workload can be reduced by scanning finer grain unit (called blocklet) instead of scanning the entire file.
    • Operable encoded data: Efficient compression and global encoding schemes of CarbonData enables query based on compressed or encoded data. Data is decoded before being returned to the users, the process of which is called "late materialized".
    • Supports applying one data format for various scenarios, for example, interactive OLAP-style query, sequential access (big scan), and random access (narrow scan).

Key Technologies and Advantages

  • Quick query response: High-performance query execution is one of the key technical advantages of CarbonData. The query speed of CarbonData is approximately 10 times faster than that of Spark SQL. CarbonData uses a specialized file format designed for high query performance. CarbonData combines multiple indexing techniques with dictionary encoding and several push down optimizations to deliver the best possible response time for queries on TB level data.
  • High data compression: CarbonData uses lightweight compression and heavyweight compression algorithms to compress the data. CarbonData can reduce the space required for storing the data by 60% to 80%, significantly saving the storage cost.
Download
Updated: 2019-05-17

Document ID: EDOC1100074548

Views: 3237

Downloads: 36

Average rating:
This Document Applies to these Products
Related Documents
Related Version
Share
Previous Next