Preparing Dump Data of a Model Running on the Ascend AI Processor
Prerequisites
To dump data of a migrated training network, ensure that the development environment has been set up by referring to CANN Software Installation Guide, and an executable training project is available after the training model is developed, built, and executed.
If the training network contains random factors, remove them before dumping.
Currently, only AI CPU and AI Core operators can be dumped. Operators such as Huawei Collective Communication Library (HCCL) operators cannot be dumped.
Generating Dump Data
Perform the following steps to generate dump data of a training network model:
- Modify the script to enable the dump function. Add the lines in bold in the corresponding positions of the script.In Estimator mode, collect dump data using dump_config in NPURunConfig. Before NPURunConfig is created, instantiate a DumpConfig class for dump configuration, including the dump path, iterations to dump, and whether to dump the input or output of the operator. For details about class DumpConfig, see Network Model Porting and Training Guide.
from npu_bridge.estimator.npu.npu_config import NPURunConfig from npu_bridge.estimator.npu.npu_config import DumpConfig # dump_path: dump path. Create the specified path in advance in the training environment (either in a container or on the host). The running user configured during installation must have the read and write permissions on this path. # enable_dump: dump enable # dump_step: iterations to dump # dump_mode: dump mode, which can be set to input, output, or all dump_config = DumpConfig(enable_dump=True, dump_path = "/home/HwHiAiUser/output", dump_step="0|5|10", dump_mode="all") session_config=tf.ConfigProto() config = NPURunConfig( dump_config=dump_config, session_config=session_config )
In session.run mode, set the dump parameters by setting the session configuration items enable_dump, dump_path, dump_step, and dump_mode.
config = tf.ConfigProto() custom_op = config.graph_options.rewrite_options.custom_optimizers.add() custom_op.name = "NpuOptimizer" custom_op.parameter_map["use_off_line"].b = True # enable_dump: dump enable custom_op.parameter_map["enable_dump"].b = True # dump_path: dump path. Create the specified path in advance in the training environment (either in a container or on the host). The running user configured during installation must have the read and write permissions on this path. custom_op.parameter_map["dump_path"].s = tf.compat.as_bytes("/home/HwHiAiUser/output") # dump_step: iterations to dump custom_op.parameter_map["dump_step"].s = tf.compat.as_bytes("0|5|10") # dump_mode: dump mode, which can be set to input, output, or all custom_op.parameter_map["dump_mode"].s = tf.compat.as_bytes("all") config.graph_options.rewrite_options.remapping = RewriterConfig.OFF with tf.Session(config=config) as sess: print(sess.run(cost))
Table 5-1 DescriptionParameter
Description
enable_dump
Data dump enable
- True: enabled. The dump file path is read from dump_path.
- False (default): disabled
dump_path
Dump path. Required when enable_dump is set to true.
The specified path must be created in advance in the environment (either in a container or on the host) where training is performed. The running user configured during the installation must have the read and write permissions on this path. The path can be an absolute path or a relative path relative to the path where the command is executed.
- If set to an absolute path, starts with a slash (/), for example, /home/HwHiAiUser/output.
- If set to a relative path, starts with a directory name, for example, output.
dump_step
Iterations to dump. Defaults to None, indicating that all iterations are dumped.
Separate multiple iterations using vertical bars (|), for example, 0|5|10. You can also use hyphens (-) to specify the iteration range, for example, 0|3-5|10.
dump_mode
Dump mode
- input: dumps only operator inputs.
- output (default): dumps only operator outputs.
- all: dumps both operator inputs and outputs.
- Run the training script and generate the dump data.
Dump data is generated to the {dump_path}/{time}/{deviceid}/{model_name}/{model_id}/{data_index} directory. For example, if {dump_path} is set to /home/HwHiAiUser/output, dump data is generated to the /home/HwHiAiUser/output/20200808163566/0/ge_default_20200808163719_121/11/0 directory.
Path format of a dump file is described as follows:
- dump_path: path set for dump_path in Step 1 (If a relative path is set, the corresponding absolute path applies.).
- time: dump time, formatted as YYYYMMDDhhmmss.
- deviceid: device ID.
- model_name: submodel name. If the model_name directory contains more than one folder, dump data in the folder of the computational graph is used.
After the training script is executed, one or multiple GE graphs will be generated in the directory of the training script. Assume that the GE graphs are named ge_proto_*****_Build.txt. The GE graph with the IteratorV2, Iterator, or GetNext in the name field is the computational graph. Value of the name field in the computational graph is used as the name of the computational graph.
- model_id: submodel ID.
- data_index: iterations to dump. If dump_step is specified, data_index and dump_step are the same. If not, data_index is indexed starting at 0 and is incremented by 1 with each dump.
- dump_file: formatted as {op_type}.{op_name}.{taskid}.{timestamp}.
Periods (.), forward slashes (/), backslashes (\), and spaces in model_name, op_type or op_name are replaced with underscores (_).- Dump data is generated in each iteration. A large training dataset generates a large volume of dump data (about dozens of GB or even more). You are advised to control the number of iterations to one.
- In the training scenario using multiple Ascend AI Processors, since the processes are not started at the same time as defined in the training script, multiple timestamp directories are generated when data is dumped.
- When the command is executed in a Docker, the generated data is stored in the Docker.