Auto Tuning in Inference Scenario
Environment Preparations
- The Auto Tune tool runs on the Ascend AI Processor. Currently, this tool supports only the development+commissioning scenario, that is, the Ascend Toolkit is installed on a device powered by the Ascend AI Processor.
- Deploy the environment by referring to CANN Software Installation Guide.
- Ensure that the available disk space in the home directory of the user who performs tuning in the operating environment is at least 20 GB.
- The Auto Tune tool is stored in the python/site-packages/schedule_search.egg/schedule_search/ and python/site-packages/auto_tune.egg/auto_tune/ directories of the ATC installation path.
- Third-party software
After the environment is deployed, install the third-party software that the Auto Tune tool depends on. For details, see Table 6-1.
Table 6-1 Dependent software of Auto TuneThird-party Software
Description
How to Install
TensorFlow 1.15
For guiding operator search in RL tuning.
-
pciutils
For querying the hardware device of the Ascend AI Processor by running the lspci command in RL tuning.
For CentOS, run the yum install pciutils command to install it.
- If the --install-for-all option is included in the ATC installation command, that is, all users have the permission to run the Auto Tune tool, ensure that the other users have the read, write, and execute permissions on the custom repository in either of the following ways:
- Use the default custom repository directory. That is, do not set TUNE_BANK_PATH. Also run the following command to modify the permission on the default custom repository:
chmod -R 777 ${install_path}/atc/data
- Use a specified custom repository directory by setting TUNE_BANK_PATH. For details, see Environment Variable Configuration.
- Use the default custom repository directory. That is, do not set TUNE_BANK_PATH. Also run the following command to modify the permission on the default custom repository:
Environment Variable Configuration
Before starting the Auto Tune tool, run the export command to declare environment variables on the terminal. The declared environment variables become invalid when the Shell terminal is closed.
export install_path=/home/HwHiAiUser/Ascend/ascend-toolkit/latest export LD_LIBRARY_PATH=${install_path}/acllib/lib64:${install_path}/atc/lib64:$LD_LIBRARY_PATH export PATH=${install_path}/atc/ccec_compiler/bin:${install_path}/atc/bin:$PATH export PYTHONPATH=${install_path}/atc/python/site-packages:${install_path}/atc/python/site-packages/auto_tune.egg/auto_tune:${install_path}/atc/python/site-packages/schedule_search.egg:$PYTHONPATH export ASCEND_OPP_PATH=${install_path}/opp # Optional environment variables of Auto Tune export TUNE_BANK_PATH=/home/HwHiAiUser/custom_tune_bank export REPEAT_TUNE=False export TUNE_OPS_NAME=conv_layers/Pad_1 # Nodes to be tuned. If specified, only the specified nodes are tuned. export ASCEND_DEVICE_ID=0 export TE_PARALLEL_COMPILER=2 export ENABLE_TUNE_BANK=True # Environment variable for offline tuning export ENABLE_TUNE_DUMP=True # Optional environment variables for offline tuning export TUNE_DUMP_PATH=/home/HwHiAiUser/DumpData
- install_path indicates the ATC, ACLlib, and OPP installation path. Replace it to the actual path.
- You can save the commands for setting environment variables to the operator custom script for future use.
The following table describes the environment variables.
Environment Variable |
Description |
---|---|
LD_LIBRARY_PATH |
(Required) Dynamic library path. Set this variable according to the preceding example.
|
PATH |
(Required) Executable program path. Set this variable according to the preceding example.
|
PYTHONPATH |
(Required) Python path. Set this variable according to the preceding example.
|
ASCEND_OPP_PATH |
(Required) OPP root directory. Set this variable according to the preceding example. An environment variable required for operator build. |
ASCEND_DEVICE_ID |
(Optional) Logical ID of the Ascend AI Processor. The value range is [0, N – 1] and the default value is 0. N indicates the device count in the physical machine, VM, or container. The system process DEVICE_ID and ASCEND_DEVICE_ID as follows:
NOTICE:
DEVICE_ID will be deprecated in later versions. You are advised to use ASCEND_DEVICE_ID for a new installation. |
DEVICE_ID |
(Optional) An integer, indicating the Ascend AI Processor device ID to perform Auto Tune. To query the IDs of available devices, perform the following steps: Run the ls -l /dev | grep davinci command on the host as the root user. In the following information, the content in bold indicates the device IDs. [root@localhost home]# ls -l /dev | grep davinci crw-rw----. 1 HwHiAiUser HwHiAiUser 241, 0 Jul 23 2020 davinci0 crw-rw----. 1 HwHiAiUser HwHiAiUser 241, 0 Jul 23 2020 davinci1 crw-rw----. 1 HwHiAiUser HwHiAiUser 241, 0 Jul 23 2020 davinci2 crw-rw----. 1 HwHiAiUser HwHiAiUser 241, 0 Jul 23 2020 davinci3 ... crw-rw----. 1 HwHiAiUser HwHiAiUser 240, 0 Jul 23 2020 davinci_manager The system process DEVICE_ID and ASCEND_DEVICE_ID as follows:
|
TE_PARALLEL_COMPILER |
(Optional) Parallel build enable. Parallel build is especially useful when a deep network is to build. TE_PARALLEL_COMPILER indicates the number of parallel operator build processes. The value must be an integer, default to 8. When the value is greater than 1, parallel build is enabled. The value range is [1, 32]. The maximum value is calculated as the follows: Number of CPU cores x 80%/Number of Ascend AI Processors. |
TUNE_BANK_PATH |
(Optional) Path of the custom repository generated after Auto Tune. For Operators Supporting RL Tuning:
For Operators Supporting GA Tuning:
NOTE:
If the repository path is customized before tuning, you also need to configure this environment variable if you want to use the custom repository during model conversion. For example, if the custom repository is stored in the /home/HwHiAiUser/custom_bank/<soc version>/ga directory, you need to configure the following environment variable when converting the model using ATC: export TUNE_BANK_PATH=/home/HwHiAiUser/custom_bank |
REPEAT_TUNE |
(Optional) Repeat tuning enable. Takes effect only when Auto Tune is enabled. If set to False and a network tuning case is available in the repository (built-in or custom), the tuning process of the case is skipped. When the logic of some operators is changed, for example, the ND input support is added to the GEMM operator. In this case, you need to set this environment variable to True and initiate tuning again. Value range: either True or False. Defaults to False. |
ENABLE_TUNE_BANK |
(Optional) Repository enable during operator build.
Defaults to True. |
TUNE_OPS_NAME |
(Optional) Specified-layer tuning, which is used in network tuning scenarios. After analyzing the profiling performance of a network, you can use this environment variable to specify a low-performance operator for tuning.
NOTE:
This environment variable applies to tuning along with network model generation only and does not support offline tuning based on dump data. |
ENABLE_TUNE_DUMP |
(Optional) Applies to offline tuning based on dump data. Operator dump enable for offline tuning. Value range: either True or False. Defaults to False. NOTE:
If this environment variable is set to True, online tuning is not performed even if Auto Tune is enabled (only dump data is generated). |
TUNE_DUMP_PATH |
(Optional) Applies to offline tuning based on dump data. Dump path for offline tuning. Set the environment variable to an absolute path or a path relative to the location of Auto Tune. You can specify a path that is readable, writable, and executable for any user. If the environment variable is not configured, the tune_dump directory will be generated in the tool execution path for storage by default. |
Tuning Procedure
Prerequisites
- Prepare the development environment and operating environment by referring to Environment Preparations and install the required software.
- Configure the environment variables on which the Auto Tune tool depends by referring to Environment Variable Configuration.
Tuning During Model Conversion with ATC
When using the ATC to convert models, you can enable Auto Tune by setting --auto_tune_mode="xx".
- "RL,GA": Both RL and GA are used for tuning. The sequence of RL and GA is not sensitive. The Auto Tune tool automatically selects the RL mode or GA mode according to the operator characteristics.
- "RL": Only Operators Supporting RL Tuning are tuned.
- "GA": Only Operators Supporting GA Tuning are tuned.
Example command:
atc --model=./tune.pb --framework=3 --output=./add_tune --output_type=FP16 --soc_version=Ascend310 --auto_tune_mode="RL,GA"
- By default, no log is generated during ATC model conversion. To generate Auto Tune log (INFO-level only), include the --log=info option in the ATC command. The Auto Tune log is generated to the /var/log/npu/slog/host-0/host-0_*.log file. For details about the ATC command-line options, see "Restrictions and Parameters" in ATC Tool Instructions.
- You can set the following tuning functions through environment variables:
- If an operator in the network model has a match in the repository, the operator will not be tuned repeatedly by default. You can configure the REPEAT_TUNE environment variable to forcibly tune the operator again.
- You can configure the TUNE_OPS_NAME environment variable to tune a specified operator layer.
- The Auto Tune tool also provides other environment variable functions. For details, see Environment Variable Configuration.
- It is allowed to start multiple ATC processes for tuning on the host. Proper process parallelism improves the tuning efficiency. However, due to resource restrictions, the tuning efficiency decreases when the number of parallel processes reaches a certain limit. The following condition should be met:
ATC process count x TE_PARALLEL_COMPILER x 2 < Host CPU core count TE_PARALLEL_COMPILER indicates the number of parallel operator build processes.
In the TBE operator parallel build scenario (that is, TE_PARALLEL_COMPILER > 1), it is recommended that one ATC process correspond to one device.
Tuning During IR Model Building
When using the Ascend Graph API to build an offline model, you can set the global_options parameter of the model building initialization API aclgrphBuildInitialize as follows to enable the Auto Tune tool:
std::map<std::string, std::string> global_options = { {ge::ir_option::SOC_VERSION, "Ascend310"}, {ge::ir_option::EXEC_DISABLE_REUSED_MEMORY, "0"}, {ge::ir_option::AUTO_TUNE_MODE, "RL,GA"} }; auto status = aclgrphBuildInitialize( global_options );
- "RL,GA": Both RL and GA are used for tuning. The sequence of RL and GA is not sensitive. The Auto Tune tool automatically selects the RL mode or GA mode according to the operator characteristics.
- "RL": Only Operators Supporting RL Tuning are tuned.
- "GA": Only Operators Supporting GA Tuning are tuned.
By default, no log is generated during IR model building. To generate Auto Tune log (INFO-level only), add the following configuration to the options parameter of the model build API aclgrphBuildModel:
{ge::ir_option::LOG_LEVEL, "info"}
The Auto Tune log is generated to the /var/log/npu/slog/host-0/host-0_*.log file. For details about how to build an IR model, see IR Model Building Guide.
You can set the following tuning functions through environment variables:
- If an operator in the network model has a match in the repository, the operator will not be tuned repeatedly by default. You can configure the REPEAT_TUNE environment variable to forcibly tune the operator again.
- You can configure the TUNE_OPS_NAME environment variable to tune a specified operator layer.
- The Auto Tune tool also provides other environment variable functions. For details, see Environment Variable Configuration.
- It is allowed to start multiple graph building processes for tuning on the host. Proper process parallelism improves the tuning efficiency. However, due to resource restrictions, the tuning efficiency decreases when the number of parallel processes reaches a certain limit. The following condition should be met:
Graph building process count x TE_PARALLEL_COMPILER x 2 < Host CPU core count TE_PARALLEL_COMPILER indicates the number of parallel operator build processes.
In the TBE operator parallel build scenario (that is, TE_PARALLEL_COMPILER > 1), it is recommended that one graph building process correspond to one device.
Offline Tuning Based on Dump Data
- Obtain the dump data (including the operator output description file and operator binary file).
- Run the tool to perform offline tuning with the obtained dump data.
The detailed operations are as follows:
- Obtain the dump data.
The dump data refers to the operator output description file and the operator binary file. The prerequisites for generating the dump data include:
- Configure dump-related environment variables. Enable dump.
- LD_LIBRARY_PATH, PYTHONPATH, and ASCEND_OPP_PATH are required environment variables for configuring Auto Tune. For details, see Environment Variable Configuration.
export install_path=/home/HwHiAiUser/Ascend/ascend-toolkit/latest export LD_LIBRARY_PATH=${install_path}/acllib/lib64:${install_path}/atc/lib64:$LD_LIBRARY_PATH export PYTHONPATH=${install_path}/atc/python/site-packages:${install_path}/atc/python/site-packages/auto_tune.egg/auto_tune:${install_path}/atc/python/site-packages/schedule_search.egg:$PYTHONPATH export ASCEND_OPP_PATH=${install_path}/opp
- Enable dump.
export ENABLE_TUNE_DUMP=True
- Set the dump path.
export TUNE_DUMP_PATH=/home/HwHiAiUser/DumpData
- LD_LIBRARY_PATH, PYTHONPATH, and ASCEND_OPP_PATH are required environment variables for configuring Auto Tune. For details, see Environment Variable Configuration.
- Use the ATC tool to convert the model or build an IR model (the Auto Tune tool does not need to be enabled) to generate dump data.
- For details about how to use the ATC tool, see ATC Tool Instructions.
- For details about how to build an IR model, see IR Model Building Guide.
After model conversion is complete, dump data is generated in the path specified by TUNE_DUMP_PATH.
- Configure dump-related environment variables. Enable dump.
- Perform offline tuning based on the dump data with Auto Tune.
The entry script for offline tuning is in the ATC installation path: atc/python/site-packages/schedule_search.egg/schedule_search/msoptune.py. Run the following command to start offline tuning by executing this Python file:
python3.7 {msoptune.py path} --start {dump_path}
Example:
python3.7 /home/HwHiAiUser/Ascend/atc/python/site-packages/schedule_search.egg/schedule_search/msoptune.py --start /home/HwHiAiUser/DumpData
- /home/HwHiAiUser/Ascend: ATC installation path. Replace it as required.
- /home/HwHiAiUser/DumpData: dump path, either an absolute path or a path relative to the directory where the current script is executed.
Currently, only one process is allowed for offline tuning on the host.
Tuning Result
This topic describes the changes of the repositories and the tuning result file after tuning is complete.
Custom Repository
- For Operators Supporting RL Tuning, the custom repository is stored in the path specified by the TUNE_BANK_PATH environment variable. If this environment variable is not set, the custom repository is stored in /atc/data/rl/<soc_version>/custom/ in the ATC installation path by default.
- For Operators Supporting GA Tuning, the custom repository is stored in the atc/data/tiling/<soc_version>/custom/ directory in the ATC installation path.
Tuning Result File
When the tuning starts, a file named tune_result_pidxxx_{timestamp}.json is generated in the Auto Tune working directory, which records the tuning process and result.
- process_data is structured as follows:
"[['Operator Name']]":{"best_ticks":[[82, "2020-08-08 18:03:38"], [104, "2020-08-08 18:03:50"],...}
Specifically,- Operator Name: name of the operator in the original graph. If graph fusion is performed during the tuning and the fused node comes from multiple nodes in the original graph, multiple operator names are displayed, for example: [['scale5a_branch1', 'bn5a_branch1', 'res5a_branch1'], ['res5a'], ['res5a_relu']]
- best_ticks: records the tuning operator result every iteration, including the tiling elapsed time and the tuning end time.
- result_data is structured as follows:
"[['Operator Name']]": {"before_tune": 66, "after_tune": 56}
Specifically,- Operator Name: name of the operator in the original graph. If graph fusion is performed during the tuning and the fused node comes from multiple nodes in the original graph, multiple operator names are displayed, for example: [['scale5a_branch1', 'bn5a_branch1', 'res5a_branch1'], ['res5a'], ['res5a_relu']]
- before_tune: time (μs) taken to execute the operator before Auto Tune is performed.
- after_tune: time (μs) taken to execute the operator after Auto Tune is performed.
During the tuning, a tune_show_pidxxx_{timestamp} folder is generated in the tuning working directory, which stores the flag file of each tuned operator. If you want to cancel the tuning of an operator, run the following command:
python3.7 /home/HwHiAiUser/Ascend/atc/python/site-packages/schedule_search.egg/schedule_search/msoptune.py --stop tune_show_{timestamp}_pidxxx
Select the operator that you want to cancel tuning as prompted. When tuning is complete, the tool compares the current repository with the existing repository. If the current repository outperforms the existing one, the existing repository is replaced in the custom directory or a new repository is added to the custom directory. Otherwise, no new repository is generated.
The tune_show_pidxxx_{timestamp} folder is automatically deleted after the tuning is complete.