No relevant resource is found in the selected language.

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy>Search

Reminder

To have a better experience, please upgrade your IE browser.

upgrade

FusionInsight HD 6.5.0 Product Description 02

Rate and give feedback:
Huawei uses machine translation combined with human proofreading to translate this document to different languages in order to help you better understand the content of this document. Note: Even the most advanced machine translation cannot match the quality of professional translators. Huawei shall not bear any responsibility for translation accuracy and it is recommended that you refer to the English document (a link for which has been provided).
CarbonData

CarbonData

CarbonData is a new Apache Hadoop native file format for faster interactive query using advanced columnar storage, index, compression and encoding techniques to improve computing efficiency, in turn it will help speedup queries an order of magnitude faster over PetaBytes of data. CarbonData is also a high performance analysis engine that integrates as datasource with Spark.

Figure 4-34 CarbonData Architecture

The goal of CarbonData is to provide ultra fast response to ad-hoc queries on big data. CarbonData is basically an OLAP engine and stores data as tables which are similar to tables in RDBMS. User can bulk load huge amount (10TB+) of data into CarbonData. CarbonData will automatically organize and store the data in a compressed multi-dimensional indexed columnar format. Once the data is loaded into CarbonData, ad-hoc queries can be executed on the data and CarbonData will provide the query response in seconds.

CarbonData integrates a data source into the Spark ecosystem and you can query and analyze the data using Spark SQL. Third party tools can connect to Spark SQL using the ThriftServer provided along with Spark.

CarbonData Cluster Topology

CarbonData runs as a datasource inside Spark. Hence CarbonData does not start any additional processes on the nodes in the cluster. Carbon's engine runs inside the spark executor process itself.

Figure 4-35 CarbonData Cluster

The data stored in CarbonData Tables are divided into several CarbonData data files. The CarbonData Engine module of CarbonData executes the actual task of reading the tables, filtering the data etc. for each incoming query. The CarbonData engine runs as a part of the spark executor process itself and is responsible for handling a subset of data file blocks.

The table data is stored in HDFS. The same Spark cluster nodes can be the data nodes on HDFS.

CarbonData Features

  • SQL Capability: CarbonData is fully compliant with Spark SQL and supports all SQL queries which can run directly on Spark SQL.
  • Easy Table Definition: CarbonData supports easy to use DDL (Data Definition Language) statements to define and create tables. CarbonData DDL is highly flexible and is very easy to use as well as powerful enough to define complex tables.
  • Easy Data management: CarbonData supports a variety of data management functions for loading data to table and maintaining the data in table. CarbonData supports bulkloading historical data as well as incrementally loading new data. Loaded data can be deleted based on load time or a specific load can be undone.
  • CarbonData file format is a columnar store in HDFS, it has many features that a modern columnar format has, such as splittable, compression schema and so on. CarbonData has following unique features:
    • Stores data along with index: it can significantly accelerate query performance and reduces the I/O scans and CPU resources, where there are filters in the query. CarbonData index consists of multiple level of indices, a processing framework can leverage this index to reduce the task it needs to schedule and process, and it can also do skip scan in more finer grain unit (called blocklet) in task side scanning instead of scanning the whole file.
    • Operable encoded data: Through supporting efficient compression and global encoding schemes, can query on compressed/encoded data, the data can be converted to encoded data just before returning the results to the users, which is "late materialized".
    • Supports for various use cases with one single Data format: like interactive OLAP-style query, Sequential Access (big scan), Random Access (narrow scan).

CarbonData key Technology and Advantages

  • Fast Query Response: High performance query execution is one of the key technology advantages of CarbonData. CarbonData queries run approximately 10 times faster than Spark SQL queries. CarbonData uses a specialized file format designed from the ground keeping query performance in mind. It combines multiple indexing techniques along with dictionary encoding and several push down optimizations to deliver the best possible response time for queries on TB level data.
  • High Data Compression: CarbonData uses a combination of few lightweight compression and heavyweight compression algorithms to compress the data. CarbonData can reduce the space required for storing the data by 60% to 80% giving huge saving on the storage hardware cost.
Download
Updated: 2019-05-17

Document ID: EDOC1100074548

Views: 3930

Downloads: 37

Average rating:
This Document Applies to these Products
Related Documents
Related Version
Share
Previous Next