No relevant resource is found in the selected language.

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy>Search

Reminder

To have a better experience, please upgrade your IE browser.

upgrade

FusionInsight HD 6.5.0 Product Description 02

Rate and give feedback:
Huawei uses machine translation combined with human proofreading to translate this document to different languages in order to help you better understand the content of this document. Note: Even the most advanced machine translation cannot match the quality of professional translators. Huawei shall not bear any responsibility for translation accuracy and it is recommended that you refer to the English document (a link for which has been provided).
Elasticsearch

Elasticsearch

Basic Principle

Structure

The Elasticsearch cluster solution consists of the EsMaster and EsClient, EsNode1, EsNode2, EsNode3, EsNode4, EsNode5, EsNode6, EsNode7, EsNode8, and EsNode9 processes, as shown in Figure 2-3. Table 2-3 describes the modules.

Figure 2-3 Elasticsearch Structure
Table 2-3 Module description

Module

Description

Client

Communicates with the EsClient and EsNode instance processes in the Elasticsearch cluster over HTTP or HTTPS to perform distributed collection and search.

EsMaster

EsMaster is the master node of Elasticsearch. It manages the cluster, such as determining shard allocation and tracing cluster nodes.

EsNode1-9

EsNode1-9 are data nodes of Elasticsearch. They store index data; and add, delete, modify, query, and aggregate documents.

EsClient

EsClient is the coordinator node of Elasticsearch. It processes routing requests, searches for data, and dispatching indexes. EsClient does not store data or manage clusters.

ZooKeeper cluster

Provides heartbeat mechanism for processes in Elasticsearch clusters.

Basic Concepts
  • Index: An index is a logical namespace in Elasticsearch, consisting of one or multiple shards. Apache Lucene is used to read and write data in the index. It is similar to the relational database (RDB) instance database. One Elasticsearch instance can contain multiple indexes.
  • Type: If documents of various structures are stored in an index, you can find the parameter mapping information according to the document type, facilitating document storage. The type is similar to the table in the database. One index corresponds to one document type.
  • Document: A document is a basic unit of information that can be indexed. This document refers to JSON data at the top-level structure or obtained by serializing the root object. The document is similar to the row in the database. A type contains multiple documents.
  • Mapping: A mapping is used to restrict the type of a field and can be automatically created based on data. The mapping is similar to the schema in the database.
  • Field: The field is the minimum unit of a document. The field is similar to the column in the database. Each document contains multiple fields.
  • EsMaster: The master node that temporarily manages some cluster-level changes, such as creating or deleting indexes, and adding or removing nodes. The master node does not participate in document level change or search. When the traffic increases, the master node does not become the bottleneck of the cluster.
  • EsNode: Elasticsearch node. A node is an Elasticsearch instance.
  • ESClient: The coordinator node of Elasticsearch. It processes routing requests, searches for data, and dispatching indexes. EsClient does not store data or manage clusters.
  • Shard: The shard is the smallest work unit in Elasticsearch. The document is stored and referenced in the shard.
  • Primary Shard: Each document in the index belongs to a primary shard. The number of primary shards determines the maximum data that can be stored in the index.
  • Replica Shard: A replica shard is a copy of the primary shard. It prevents data loss caused by hardware faults and provides read requests, such as searching for or retrieving documents from other shards.
  • Recovery: Indicates data restoration or data redistribution. When a node is added or deleted, ElasticSearch redistributes index shards based on the load of the corresponding physical server. When a faulty node is restarted, data restoration is also performed.
  • Gateway: Indicates the storage mode of an ElasticSearch index snapshot. By default, ElasticSearch stores an index in the memory. When the memory is full, ElasticSearch persistently saves the index to the local hard disk. A gateway stores index snapshots. When the corresponding ElasticSearch cluster is stopped and then restarted, the index backup data is read from the gateway. ElasticSearch supports multiple types of gateways, including local file systems (default), distributed file systems, Hadoop HDFS, and Amazon S3 cloud storage.
  • Transport: Indicates the interaction mode between ElasticSearch internal nodes or clusters and the ElasticSearch client. By default, TCP is used for interaction. In addition, HTTP (JSON format), Thrift, Servlet, Memcached, and ZeroMQ transmission protocols (integrated through plug-ins) are supported.
  • ZooKeeper Cluster: It is mandatory in Elasticsearch and provides functions such as storage of security authentication information.
Working Principles
  • Internal architecture

    The Elasticsearch provides various access interfaces through RESTful APIs or other languages (such as Java), uses the cluster discovery mechanism, supports script languages, and various plug-ins. The underlying layer is based on Lucene, with absolute independence of Lucene and stores indexes through local files, shared files, and HDFS, as shown in Figure 2-4.

    Figure 2-4 Internal architecture
  • Inverted indexing

    In the traditional search mode, as shown in Figure 2-5, documents are searched based on their IDs. During the search, keywords of each document are scanned to find all information containing the keywords. Forward indexing is easy to maintain but is time consuming.

    Figure 2-5 Forward indexing

    For Elasticsearch (Lucene), as shown in Figure 2-6, there is a dictionary composed of keywords and their statistic information, such as IDs, positions, and document frequencies. In this search mode, Elasticsearch searches for the keywords to locate the document ID and position and then finds the document, which is similar to the method of looking up a word in a dictionary. Alternatively, Elasticsearch searches the catalog to find the content on a specific page. Inverted indexing is efficient in search, but is time consuming for constructing indexes and costly for maintenance.

    Figure 2-6 Inverted indexing
  • Distributed indexing flow

    Figure 2-7 shows the process of Elasticsearch distributed indexing flow.

    Figure 2-7 Distributed indexing flow

    The process is as follows:

    Phase 1: The client sends an index request to any node, for example, Node 1.

    Phase 2: Node 1 determines the shard to store the file based on the request. Assume the shard is shard 0. Node 1 then forwards the request to Node 3 where primary shard P0 of shard 0 exists.

    Phase 3: Node 3 executes the request on primary shard P0 of shard 0. If the request is successfully executed, Node 3 sends the request to all the replica shard R0 in Node 1 and Node 2 concurrently. If all the replica shards successfully execute the request, a verification message is returned to Node 3. After receiving the verification messages from all the replica shards, Node 3 returns a success message to the user.

  • Distributed searching flow

    The Elasticsearch distributed searching flow consists of two phases: Query and acquisition.

    Figure 2-8 shows the query phase.

    Figure 2-8 Query phase of the distributed searching flow

    The process is as follows:

    Phase 1: The client sends a retrieval request to any node, for example, Node 3.

    Phase 2: Node 3 sends the retrieval request to each shard in the index adopting the polling policy. One of the primary shards and all of its replica shards is randomly selected to balance the read request load. Each shard performs retrieval locally and adds the sorting result to the local node.

    Phase 3: Each shard returns the local result to Node 3. Node 3 combines these values and performs global sorting.

    In the query phase, the data to be retrieved is located. In the acquisition phase, these data will be collected and returned to the client. Figure 2-9 shows the acquisition phase.

    Figure 2-9 Acquisition phase of the distributed searching flow

    The process is as follows:

    Phase 1: After all data to be retrieved are located, Node 3 sends a request to related shards.

    Phase 2: Each shard that receives the request from Node 3 reads the related files and return them to Node 3.

    Phase 3: After obtaining all the files returned by the shards, Node 3 combines them into a summary result and returns it to the client.

  • Distributed bulk indexing flow

    The process is as follows:

    Phase 1: The client sends a bulk request to Node 1.

    Phase 2: Node 1 constructs a bulk request for each shard and forwards the requests to the primary shard according to the request.

    Phase 3: The primary shard executes the requests one by one. After one operation is complete, the primary shard forwards the new file (or deleted part) to the corresponding replica node,

    then move on to the next operation. Replica nodes report to the request node that all operations are complete. The request node sorts the response and returns it to the client.

  • Distributed bulk searching flow

    The process is as follows:

    Phase 1: The client sends a mget request to Node 1.

    Phase 2: Node 1 constructs a multi-data retrieval request for each shard and forwards the requests to the primary shard or its replica shard based on the requests. When all replies are received, Node 1 constructs a response and returns it to the client.

  • Routing algorithm

    Elasticsearch provides two routing algorithms:

    • Default route: shard=hash (routing) %number_of_primary_shards. In this routing policy, the number of received shards is limited. During capacity expansion, the number of shards needs to be multiplied (ES6.x). In addition, when creating an index, you need to specify the capacity to be expanded in the future. ES5.x does not support capacity expansion. ES7.x can be expanded freely.
    • Custom route: In this routing mode, the routing can be specified to determine the shard to which the file is written, or only the specified routing can be searched.
  • Balancing algorithm

    Elasticsearch provides the automatic balance function for capacity expansion, capacity reduction, and data import scenarios. The algorithm is as follows:

    weight_index(node, index) = indexBalance * (node.numShards(index) - avgShardsPerNode(index))

    Weight_node(node, index) = shardBalance * (node.numShards() - avgShardsPerNode)

    weight(node, index) = weight_index(node, index) + weight_node(node, index)

  • Single-node multi-instance deployment

    Multiple Elasticsearch instances can be deployed on the same node, and differentiated from each other based on the IP address and port number. This method increases the usage of the single-node CPU, memory, and disk, and improve the Elasticsearch indexing and searching capability.

  • Cross-node replica allocation policy

    When multiple instances are deployed on a single node and multiple replicas exist, replicas can only be allocated across instances. A single-point failure may occur. Solve this problem by configuring the parameter cluster.routing.allocation.same_shard.host to true.

Relationship with Other Components

Platform Location

Figure 2-10 shows the position of Elasticsearch in the application platform.

Figure 2-10 The position of Elasticsearch in the platform
Elasticsearch Indexing HBase Data

When Elasticsearch indexes the HBase data, the HBase data is written to the HDFS and meanwhile Elasticsearch creates the corresponding HBase index data. The index ID is mapped to the rowkey of the HBase data, which ensures the unique mapping between each index data record and HBase data and implements full-text searching of the HBase data.

Batch indexing: For data already existing in HBase, an MR task is submitted to read all data in HBase, and then indexes are created in Elasticsearch. Figure 2-11 shows the indexing process.

Figure 2-11 Elasticsearch indexing HBase data
Download
Updated: 2019-05-17

Document ID: EDOC1100074548

Views: 3248

Downloads: 36

Average rating:
This Document Applies to these Products
Related Documents
Related Version
Share
Previous Next