No relevant resource is found in the selected language.

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy>Search

Reminder

To have a better experience, please upgrade your IE browser.

upgrade

FusionInsight HD 6.5.0 Software Installation 02

Rate and give feedback:
Huawei uses machine translation combined with human proofreading to translate this document to different languages in order to help you better understand the content of this document. Note: Even the most advanced machine translation cannot match the quality of professional translators. Huawei shall not bear any responsibility for translation accuracy and it is recommended that you refer to the English document (a link for which has been provided).
Configuring S3 Interconnection

Configuring S3 Interconnection

Introduction to S3

Simple Storage Service (S3) is an Internet-oriented simple storage service provided by Amazon. It has a simple web service interface and can be used as a third-party storage system to store and back up data in the HDFS. Among big data components, Hadoop, Hive, and Spark can access S3 by using the S3 file system. Except for HDFS, S3 is a third-party storage service for FusionInsight and cannot replace HDFS.

Figure 5-3 Architecture of the connection between FusionInsight and FusinonStorage

S3 Application Scenario

  • When the cluster data scale is large, data in the HDFS can be backed up to S3 or existing data on S3 can be restored to HDFS.
  • Use Hive to create tables on S3, analyze existing S3 data, or use Hive to analyze data in HDFS and write data into S3 tables.
  • Spark applications can directly read and write data on S3. Spark-sql analyzes the existing data on S3 or analyzes the data in the HDFS and writes the data into the S3 tables.

S3 Configuration Methods

You can configure S3 using either of the following methods:

S3 Benefits

  • Cold data in the HDFS is backed up to S3, saving storage costs.
  • Data stored on S3 can be shared by multiple clusters.
  • Data on S3 will not be deleted when the computing cluster is destroyed.

S3 Shortcomings

  • S3 is not real-time and only ensures the final consistency of data.
  • Operations on HDFS directories are non-ACID on S3, and the operation time is in direct proportion to the number of file objects. Such operations include the renaming and reloading operations. If an interruption occurs, there may be residual files.
  • S3 cannot be used to perform operations on HDFS files and directory permissions and does not support the ACL mechanism.
  • S3 cannot be used to perform operations on HBase files.

Configuring S3 Authentication Information on FusionInsight Manager

Configuring S3 Authentication Information on FusionInsight Manager
Scenarios

This section describes how to perform the following operations: Configure S3 authentication information for HDFS, Hive, and Spark/Spark2x on FusionInsight Manager. Download the client after the configuration. Run required Hadoop commands to access S3 data on the client. View the S3 data list. Upload local files to S3. Download S3 data to the client. Create folders on S3.

Spark/Spark2x supports using the Spark-sql and Spark applications for configuring the Spark client.

Prerequisites

You have logged in to FusionInsight Manager.

Procedure
  1. Choose Service > HDFS > Configuration.
  2. Choose All Configurations > HDFS.
  3. Choose NameNode > S3service.
  4. Configure interconnection parameters by referring to Table 5-8.

    Table 5-8 Interconnection parameters

    Parameter

    Description

    Default Value

    fs.s3a.access.key

    Key ID used for accessing the S3A file system.

    <empty>

    fs.s3a.secret.key

    Key used to access the S3A file system.

    <empty>

    fs.s3a.endpoint

    The connection point of the S3 service.

    s3.amazonaws.com

    fs.s3a.signing-algorithm

    Signature algorithm used for accessing S3. By default, the S3SignerType signature algorithm of the old version is used. If this parameter is not selected, the default signature algorithm is used, which is incompatible with Huawei OBS is incorrect.

    S3SignerType

    fs.s3a.connection.ssl.enabled

    Whether to enable the SSL to connect to the S3 service. If this parameter is set to true, the S3 server must upload the trusted certificate issued by the root certificate authority.

    false

    hadoop.security.credstore.java-keystore-provider.password

    User-defined password of all key libraries. This parameter is mandatory when fs.s3a.access.key and fs.s3a.secret.key are configured.

    <empty>

  5. Enter required information for S3. For the preceding parameters that are kept empty by default, you must enter a value for each them.
  6. Click Save. On the displayed dialog box, click OK to restart the Hive and Spark services.

    NOTE:

    Hive and Spark depend on the S3 configuration in the HDFS. After the S3 configuration in the HDFS is modified, the configurations for Hive and Spark will expire and need to be restarted for the configuration to take effect.

    To use CarbonData to connect to S3, use Spark2x and modify Spark2x configurations. CarbonData in Spark is of a low version and does not support interconnection with S3.

Configuring a Component Client to Access S3

Configuring the HDFS Client
Scenarios

This section instructs software installation engineers to configure S3 access for the HDFS client after S3 is authenticated on FusionInsight Manager.

Operation Restrictions

S3 does not support the following Hadoop commands:

  • hadoop fs -getfacl
  • hadoop fs -setfacl
  • hadoop fs -getfattr
  • hadoop fs -setfattr

When distcp is used to back up data to S3, the following fields cannot be used:

  • append
  • diff
  • atomic
  • Do not use ax when the p parameter is used.
  • Do not use skipCrc because the CRC check is not performed regardless of whether it is set or not.
Procedure
  1. Download and install the client by referring to Installing a Client.
  2. Use PuTTY to log in to the node on which the client is installed as the client installation user.
  3. Run the following command to go to the client installation directory:

    cd /opt/hadoopclient

  4. Run the following command to configure environment variables:

    source bigdata_env

  5. If the cluster is in security mode, run the following command to perform user authentication. In normal mode, user authentication is not required.

    kinit Component service user

  6. Before using the shell command to access S3 for the first time, run the following command. If the command has been executed, skip this operation.

    echo "hadoop_add_to_classpath_tools hadoop-aws" > ~/.hadooprc

  7. Run Hadoop shell commands to create a bucket name on S3. You can customize the bucket name for S3. Examples:

    hadoop fs -ls s3a://bucketname/path

    hadoop fs -cat s3a://bucketname/filename

    hadoop fs -mkdir s3a://bucketname/newpath

    hadoop fs -put localfilename s3a://bucketname/path

  8. Back up the data in the HDFS to S3. For example, back up the HDFS data to the following directory:

    hadoop distcp /tmp/bigfile s3a://bucketname/path/bigfile

  9. Restore data on S3 to HDFS. For example, restore the S3 data to the following directory on HDFS:

    hadoop distcp s3a://bucketname/path/bigfile /tmp/bigfile

Configuring the Hive Client
Scenarios

This section instructs software installation engineers to configure S3 access for the Hive client after S3 is authenticated on FusionInsight Manager.

Procedure

Accessing S3 from the Hive client

  1. Download and install the client by referring to Installing a Client.
  2. Use PuTTY to log in to the node on which the client is installed as the client installation user.
  3. Run the following command to go to the client installation directory:

    cd /opt/hadoopclient

  4. Run the following command to configure environment variables:

    source bigdata_env

  5. When you use the client to connect to a specific Hive instance in a scenario where multiple Hive instances are installed, run the following command to load the environment variables of the instance. Otherwise, skip this operation. For example, load the environment variables of the Hive2 instance.

    source Hive2/component_env

  6. Log in to the Hive client based on the cluster authentication mode.

    • In the security mode, run the following command to complete user authentication and log in to the Hive client:

      kinit Component service user

      beeline

    • In the normal mode, run the following command to log in to the Hive client. If no component service user is specified, the current OS user is used to log in to the Hive client.

      beeline -n Component service user

  7. Create a Hive table on S3 by setting the LOCATION parameter. Example:

    CREATE EXTERNAL TABLE `student`( 
    `id` int, 
    `name` string, 
    `class` string) 
    PARTITIONED BY (`region` string) STORED AS TEXTFILE 
    LOCATION 's3a://bucketname/path/student';
    NOTE:

    When a table is created, data in the partitioned table is not automatically loaded. If the partition-related information already exists in s3a://Bucket name/path/student, run the MSCK REPAIR TABLE student command to add the partition information. If the directory does not contain any partition information, you need to load data from other S3 directories. Example:

    LOAD DATA INPATH 's3a://bucketname/path/xian' OVERWRITE INTO 
    TABLE `student` PARTITION(region='xian');

Configuring the Spark Client

Configuring Spark-sql
Scenarios

This section instructs software installation engineers to configure S3 access for the Spark client after S3 is authenticated on FusionInsight Manager.

Spark supports two S3 access configuration modes: Spark-sql and Spark. This topic describes how to configure the Spark-sql to access S3 and use SQL statements to perform operations on S3 data. For details about how to configure Spark application modes, see "Configuring the Spark Client".

Prerequisites
  • You have logged in to FusionInsight Manager.
  • S3 authentication information for HDFS has been configured on FusionInsight Manager.
Procedure

Parameter configuration

NOTE:

This section uses Spark2x as an example. The procedure for configuring Spark client is the same as that for Spark2x client. CarbonData in Spark is of a low version and does not support interconnection with S3. Use CarbonData of Spark2x.

  1. Choose Service > HDFS > Configuration.
  2. Choose All > HDFS.
  3. Choose NameNode > S3service.
  4. Obtain the parameter description in the HDFS. Spark inherits the S3 configuration in the HDFS. After the modification, restart the Spark2x service.

Accessing S3 from the Spark2x client

  1. Download and install the client by referring to Installing a Client.
  2. Use PuTTY to log in to the node on which the client is installed as the client installation user.
  3. Run the following command to go to the client installation directory:

    cd /opt/hadoopclient

  4. Run the following command to configure environment variables:

    source bigdata_env

  5. When you use the client to connect to a specific Spark2x instance in a scenario where multiple Spark2x instances are installed, run the following command to load the environment variables of the instance. Otherwise, skip this operation. For example, load the environment variables of the Spark2x2 instance.

    source Spark2x2/component_env

  6. If the cluster is in security mode, run the following command to perform user authentication. In normal mode, user authentication is not required.

    kinit Component service user

  7. I run the following command to start the spark-sql:

    spark-sql

  8. Create a table pointing to S3. For example:

    CREATE EXTERNAL TABLE `employee`( 
    `eid` int, 
    `name` string, 
    `salary` string) 
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' 
    STORED AS TEXTFILE 
    LOCATION 's3a://bucketname/path/employee';

  9. Analyze S3 data. The following is an example:

    SELECT COUNT(*) FROM employee;

Configuring Spark Client
Scenarios

This section instructs software installation engineers to configure S3 access for the Spark client after S3 is authenticated on FusionInsight Manager.

Spark supports Spark-sql and Spark application modes. This section describes how to configure the Spark application for S3 access. For details about how to configure the Spark-sql, see Configuring Spark-sql.

Procedure
NOTE:

This section uses Spark2x as an example. The procedure for configuring Spark client is the same as that for Spark2x client. CarbonData in Spark is of a low version and does not support interconnection with S3. Use CarbonData of Spark2x.

  1. Download and install the client by referring to Installing a Client.
  2. Use PuTTY to log in to the node on which the client is installed as the client installation user.
  3. Run the following command to go to the client installation directory:

    cd /opt/hadoopclient

  4. Run the following command to configure environment variables:

    source bigdata_env

  5. When you use the client to connect to a specific Spark2x instance in a scenario where multiple Spark2x instances are installed, run the following command to load the environment variables of the instance. Otherwise, skip this operation. For example, load the environment variables of the Spark2x2 instance.

    source Spark2x2/component_env

  6. If the cluster is in security mode, run the following command to perform user authentication. In normal mode, user authentication is not required.

    kinit Component service user

  7. Access the data on S3 from the application. Example:

    val myRDD = sc.textFile("s3a://bucketname/path/input/*")

  8. Use spark-submit to submit the Spark application. Example:

    spark-submit --class org.apache.spark.examples.Test --master yarn --deploy-mode cluster --name sparkTest /opt/test-example.jar

Configuring S3 Authentication Information on a Component Client

Configuring S3 Authentication Information on the HDFS Client
Scenarios

This section instructs software installation engineers to configure S3 authentication information on the HDFS client and to access S3 data on the client by using Hadoop commands. For example, software installation engineers can view the data list on S3, upload local files to S3, download S3 data to the client, and create folders on S3.

Prerequisites
  • The client has been installed. For details, see Installation Client. For example, the client installation package is installed in the /opt/client directory.
  • Service components are created by the administrator as required. For details about how to create service components, see Creating Users. In the security mode, human-machine users need to change their passwords upon the first login.
Procedure
  1. Use PuTTY to log in to the node on which the client is installed as the client installation user.
  2. Run the following command to go to the client installation directory:

    cd /opt/client

  3. Run the following command to configure environment variables:

    source bigdata_env

  4. Modify the configuration in the core-site.xml file under opt/client/HDFS/hadoop/etc/hadoop/ and that under /opt/client/Yarn/hadoop/etc/hadoop/ to add the following configurations (replace the values of access-key, secret-key, and obs-address with your own access key, secret key, and service address). If they have been modified, skip this operation.

    <property> 
    <name>fs.s3a.access.key</name> 
    <value>access-key</value> 
    </property> 
    <property> 
    <name>fs.s3a.secret.key</name> 
    <value>secret-key</value> 
    </property> 
    <property> 
    <name>fs.s3a.endpoint</name> 
    <value>obs-address</value> 
    </property>

  5. Before using the shell command to access S3 for the first time, run the following command. If the command has been executed, skip this operation.

    echo "hadoop_add_to_classpath_tools hadoop-aws" > ~/.hadooprc

  6. Run Hadoop shell commands to create a bucket name on S3. You can customize the bucket name for S3. Examples:

    hadoop fs -ls s3a://bucketname/path

    hadoop fs -cat s3a://bucketname/filename

    hadoop fs -mkdir s3a://bucketname/newpath

    hadoop fs -put localfilename s3a://bucketname/path

  7. Back up the data in the HDFS to S3. For example, back up the HDFS data to the following directory:

    hadoop distcp /tmp/bigfile s3a://bucketname/path/bigfile

  8. Restore data on S3 to HDFS. For example, restore the S3 data to the following directory on HDFS:

    hadoop distcp s3a://bucketname/path/bigfile /tmp/bigfile

NOTE:
  • S3 does not support the following Hadoop commands:

    hadoop fs -getfacl

    hadoop fs -setfacl

    hadoop fs -getfattr

    hadoop fs -setfattr

  • When distcp is used to back up data to S3, the following fields cannot be used:
  • append
  • diff
  • atomic
  • Do not use ax when the p parameter is used.
  • Do not use skipCrc because the CRC check is not performed regardless of whether it is set or not.
Configuring S3 Authentication Information on the HDFS Client
Scenarios

Data sets stored on S3 can be easily used as external tables or internal tables in Hive. The difference between the two table types is that the data in an external table is not deleted when the table is deleted. Users need to analyze data in Hive tables in the HDFS, insert the data into the Hive tables on S3, analyze the existing data on S3, analyze the Hive table data on S3, and insert the result into the Hive tables in the HDFS.

The Hive client supports only the modification of access key or secret key interconnection parameters. For details about how to modify other parameters, see "Configuring S3 Authentication Information on FusionInsight Manager.

Prerequisites
  • The client has been installed. For details, see Installation Client. For example, the client installation package is installed in the /opt/client directory.
  • Service components are created by the administrator as required. For details about how to create service components, see Creating Users. In the security mode, human-machine users need to change their passwords upon the first login.
  • S3 service address information for HDFS has been configured on FusionInsight Manager.
Procedure
  1. Use PuTTY to log in to the node on which the client is installed as the client installation user.
  2. Run the following command to go to the client installation directory:

    cd /opt/client

  3. Run the following command to configure environment variables:

    source bigdata_env

  4. When you use the client to connect to a specific Hive instance in a scenario where multiple Hive instances are installed, run the following command to load the environment variables of the instance. Otherwise, skip this operation. For example, load the environment variables of the Hive2 instance.

    source Hive2/component_env

  5. Log in to the Hive client based on the cluster authentication mode.

    • In the security mode, run the following command to complete user authentication and log in to the Hive client:

      kinit Component service user

      beeline

    • In the normal mode, run the following command to log in to the Hive client. If no component service user is specified, the current OS user is used to log in to the Hive client.

      beeline -n Component service user

  6. Each time you start the beeline, create, or access tables on S3, you can manually set the following parameters to access data on S3:

    set fs.s3a.access.key=ak; 
    set fs.s3a.secret.key=sk; 
    set metaconf:fs.s3a.access.key=ak; 
    set metaconf:fs.s3a.secret.key=sk;

  7. Create a Hive table on S3 by setting the LOCATION parameter.

    NOTE:

    If the connection fails, configure other Hive interconnection parameters by referring to "Configuring S3 Authentication Information on FusionInsight Manager.

    Example:

    CREATE EXTERNAL TABLE `student`( 
    `id` int, 
    `name` string, 
    `class` string) 
    PARTITIONED BY (`region` string) STORED AS TEXTFILE 
    LOCATION 's3a://bucketname/path/student';
    NOTE:

    When a table is created, data in the partitioned table is not automatically loaded. If the partition-related information already exists in s3a://Bucket name/path/student, run the MSCK REPAIR TABLE student command to add the partition information. If the directory does not contain any partition information, you need to load data from other S3 directories. Example:

    LOAD DATA INPATH 's3a://bucketname/path/xian' OVERWRITE INTO 
    TABLE `student` PARTITION(region='xian');

Configuring S3 Authentication Information on the Spark Client
Scenarios

This section instructs software installation engineers to configure S3 authentication information on the Spark client and read and write data on S3 by compiling the Spark program for composite service requirements.

NOTE:

To use CarbonData to connect to S3, use Spark2x and modify Spark2x configurations. CarbonData in Spark is of a low version and does not support interconnection with S3.

Prerequisites
  • The client has been installed. For details, see Installation Client. For example, the client installation package is installed in the /opt/client directory.
  • Service components are created by the administrator as required. For details about how to create service components, see Creating Users. In the security mode, human-machine users need to change their passwords upon the first login.
Procedure
  1. Use PuTTY to log in to the node on which the client is installed as the client installation user.
  2. Run the following command to go to the client installation directory:

    cd /opt/hadoopclient

  3. Run the following command to configure environment variables:

    source bigdata_env

  4. When you use the client to connect to a specific Spark2x instance in a scenario where multiple Spark2x instances are installed, run the following command to load the environment variables of the instance. Otherwise, skip this operation. For example, load the environment variables of the Spark2x2 instance.

    source Spark2x2/component_env

  5. If the cluster is in security mode, run the following command to perform user authentication. In normal mode, user authentication is not required.

    kinit Component service user

  6. Configures the S3 authentication information in the application code as the user. For example, sc in the code indicates the SparkContext object. Run the sc command to configure the S3 authentication information.

    Sc.hadoopConfiguration.set("fs.s3a.access.key", "access_key") // Replace the value of access_key with the actual access key.
    sc.hadoopConfiguration.set("fs.s3a.secret.key", "secret_key") //, Replace the value of secret_key with the actual secret key.
    sc.hadoopConfiguration.set("fs.s3a.endpoint", "obs-address") //, Replace the value of obs-address with the actual service address of S3.

  7. Modify the configuration in the Spark client configuration file core-site.xml. For example, if the Spark2x instance is used, modify the configuration in the core-site.xml file under /opt/client/Spark2x/spark/conf/ to add the following configurations (replacing the access key, secret key, and service address). If the configuration file has been modified, skip this operation.

    <property> 
    <name>fs.s3a.access.key</name> 
    <value>access-key</value> 
    </property> 
    <property> 
    <name>fs.s3a.secret.key</name> 
    <value>secret-key</value> 
    </property> 
    <property> 
    <name>fs.s3a.endpoint</name> 
    <value>obs-address</value> 
    </property>

  8. Use spark-submit to submit the Spark application. Example:

    spark-submit --class org.apache.spark.examples.Test --master yarn --deploy-mode cluster --name sparkTest /opt/test-example.jar

Download
Updated: 2019-05-17

Document ID: EDOC1100074555

Views: 6114

Downloads: 6

Average rating:
This Document Applies to these Products
Related Version
Related Documents
Share
Previous Next