No relevant resource is found in the selected language.

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy>Search

Reminder

To have a better experience, please upgrade your IE browser.

upgrade

FusionInsight HD V100R002C60SPC200 Product Description 06

Rate and give feedback:
Huawei uses machine translation combined with human proofreading to translate this document to different languages in order to help you better understand the content of this document. Note: Even the most advanced machine translation cannot match the quality of professional translators. Huawei shall not bear any responsibility for translation accuracy and it is recommended that you refer to the English document (a link for which has been provided).
Streaming

Streaming

Basic Concept

Function

Streaming is a distributed, reliable, and fault-tolerant real-time computing system based on the open-source Apache Storm. It is used to process data streams of a massive scale on a real-time basis. Apache Storm applies to real-time analysis, continuous computation, and distributed Extract, Transform, and Load (ETL). Storm has the following features:

  • Wide applications
  • Scalable
  • Free from data loss
  • Fault-tolerant
  • Easy to construct and control
  • Multi-language support

Streaming uses Storm as the computing platform and provides Continuous Query Language (CQL) in the service layer to facilitate service implementation. Compared with Structured Query Language (SQL), CQL has introduced the concept of window. CQL applies to service scenarios, such as KPI statistics. CQL has the following features:

  • Easy to use: The CQL syntax is similar to the SQL syntax. Users who have basic knowledge of SQL can easily learn CQL and use CQL to develop services.
  • Rich functions: In addition to basic expressions provided by SQL, CQL provides functions, such as windows, filtering, and concurrency setting, for steam processing.
  • Easy to scale: CQL provides an extension interface to supporting increasingly complex service scenarios. Users can customize the input, output, serialization, and deserialization to meet specific service requirements.
  • Easy to debug: CQL provides detailed explanation of error codes, facilitating users to rectify faults.
Architecture

The Streaming service consists of the active and standby Nimbus processes, UI processes, and multiple Supervisor processes, as shown in Figure 2-73.

Figure 2-73 Streaming architecture
Table 2-21 Streaming modules
Name Description
Nimbus Control nodes of the Streaming service. The Streaming service in HA mode has one active Nimbus and one standby Nimbus.
  • The active Nimbus receives tasks submitted by clients and dispatches tasks to Supervisors. It also monitors status of other processes.
  • The standby Nimbus takes over services from the active Nimbus if the active Nimbus is faulty.
Supervisor Monitors and receives tasks allocated by Nimbus, and start or stop Workers based on actual situation. Worker is a process running specific service logic. Each Worker is a JVM process.
UI Streaming service monitoring interface, through which the system running topology can be viewed.
ZooKeeper Provides distributed coordination services for processes of Streaming. Active and standby Nimbus, Supervisors, and Workers are registered with ZooKeeper so that Nimbus can obtain the health status of each process.
Principle
  • Basic Concepts
    Table 2-22 Concepts
    Concept Description
    Tuple A Tuple is an invariable Key-Value pair used to transfer data. Tuples are created and processed in distributed manner.
    Stream A Stream is an unbounded sequence of Tuples.
    Topology A Topology is a real-time application running on the Storm platform. It is a Directed Acyclic Graph (DAG) composed of components. A Topology can concurrently run on multiple machines. Each machine runs a part of the DAG. A Topology is similar to a MapReduce job. The difference is that Topology is a resident program. Once started, a Topology cannot stop unless it is manually terminated.
    Spout A Spout is a source of Tuples. For example, a Spout may read data from a message queue, database, file system, or TCP connection and emit them as Tuples, which are processed by the next component.
    Bolt A Bolt is a component that receives data and executes specific logic, such as filtering or converting tuples, joining or aggregating streams, and performing statistics and result persistence.
    Worker A Worker is a physical processing in running state in a Topology. Each Worker is a JVM process. Each Topology may be executed by multiple Workers. Each Worker executes a logic subset of the Topology.
    Task A Task is a Spout or Bolt thread of a Worker.
    Stream groupings A Stream grouping specifies the Tuple dispatching policies. It instructs the subsequent Bolt how to receive tuples. The supported policies include Shuffle Grouping, Fields Grouping, All Grouping, Global Grouping, Non Grouping, and Directed Grouping.

    Figure 2-74 shows a Topology (DAG) consisting of a Spout and Bolts. In Figure 2-74, a rectangle indicates a Spout or Bolt, the nodes in each rectangle indicate Tasks, and the lines between Tasks indicate Streams.

    Figure 2-74 Topology
  • Reliability

    Storm provides three levels of data reliability:
    • At most once: The processed data may be lost, but it cannot be processed repeatedly. This reliability level offers the highest throughput.
    • At least once: Data may be processed repeatedly to ensure reliable data transmission. If a response is not received within the specified time, the Spout resends the data for processing. This reliability level may slightly affect system performance.
    • Exactly-once: Data is successfully transmitted without loss or redundancy processing. This reliability level offers the poorest performance.

    Select the reliability level based on service requirements. For example, for the services requiring high data reliability, use the exactly-once level to ensure that data is processed only once. For the services insensitive to data loss, use other levels to improve system performance.

  • Fault Tolerance

    Storm is a fault-tolerant system that offers high availability. Table 2-23 describes the fault tolerance of the Storm components.

    Table 2-23 Fault tolerance
    Scenario Description
    Nimbus failed Nimbus is fail-fast and stateless. If the active Nimbus is faulty, the standby Nimbus takes over services immediately.
    Supervisor failed Supervisor is a background daemon of Workers. It is fail-fast and stateless. If a Supervisor is faulty, the Workers running on the node are not affected but cannot receive new tasks. The OMS can detect the fault of Supervisors and restart the processes.
    Worker failed If a Worker is faulty, the Supervisor will restart the Worker. If the Worker fails to start after multiple attempts, Nimbus will assign the task to another node.
    Node failed If a node is faulty, all the tasks being processed by the node time out and Nimbus will assign the tasks to another node for processing.
Open-source Features
  • Distributed Real-time Computing Framework

    In a Storm cluster, each machine supports the running of multiple work processes and each work process can create multiple threads. Each thread can execute multiple tasks. A task indicates concurrent data processing.

  • High Fault Tolerance

    During message processing, if a node or a process is faulty, the message processing unit can be redeployed.

  • Reliable Messages

    Data processing methods including At-Least Once, At-Most Once, and Exactly Once are supported.

  • Support for Multiple Languages

    In addition to Java, users can use any language that they are familiar with to implement logic development. Spout or Bolt are allowed to use the standard input and standard output methods to transfer messages. The messages transferred are single-line text or multi-line text under the JSON encoding.

  • Security Mechanism

    Streaming provides Kerberos-based authentication and pluggable authorization mechanisms, supports SSL Storm UI and Log Viewer UI, and supports security integration with other big data platform components (such as ZooKeeper and HDFS).

  • Rolling Upgrade

    Streaming supports seamless upgrade to the latest version. That is, during software upgrade, the service does not need to be deployed again.

  • Flexible Topology Definition and Deployment

    The Flux framework is used to define and deploy service topologies. If the service DAG is changed, users only need to modify YAML domain-specific language (DSL), but do not need to recompile and package service code.

  • Integration with External Components

    Streaming supports integration with multiple external components such as Kafka, HDFS, HBase, Redis, and JDBC/RDBMS, implementing services that involve multiple data sources.

Relationship with Other Components

Storm provides a real-time distributed computing framework. It can obtain real-time messages from data sources (such as Kafka and TCP connection), perform high-throughput and low-latency real-time computing on a real-time platform, and export results to message queues or implement data persistence. Figure 2-75 shows the relationship between Storm and other components.
Figure 2-75 Relationship with other components
Translation
Download
Updated: 2019-04-10

Document ID: EDOC1000104139

Views: 6683

Downloads: 65

Average rating:
This Document Applies to these Products
Related Documents
Related Version
Share
Previous Next