Data Dump
Overview
The system supports dump of operators during training. Then, you can use the Model Accuracy Analyzer to analyze the accuracy of the operators. Currently, the following dump modes are supported:
- input: dumps only operator inputs.
- output: dumps only operator outputs.
- all: dumps both operator inputs and outputs.
If you need to enable dump during training (disabled by default), modify the training script as follows.
Keep the following points in mind for data dump:
- Currently, all iterations can be dumped. You can specify the iterations to be dumped. If the training dataset is large, the dump data volume of each iteration can reach about dozens of GB or even more. You are advised to control the number of iterations.
- The data dump and debug dump functions are mutually exclusive.
- Currently, only dump data of AI Core and AI CPU operators can be collected. Dump data of collective communication operators cannot be collected.
Collecting Dump Data with Estimator
In Estimator mode, use dump_config in NPURunConfig to collect dump data. Before NPURunConfig is created, instantiate a DumpConfig class for dump configuration, including the dump path, iteration data to be dumped, and whether to dump the inputs or outputs of the operator.
For details about each field in the constructor of the DumpConfig class, see the description of the corresponding API.
from npu_bridge.estimator.npu.npu_config import NPURunConfig from npu_bridge.estimator.npu.npu_config import DumpConfig # dump_path: dump path. Create the specified path in advance in the training environment (either in a container or on the host). The running user configured during installation must have the read and write permissions on this path. # enable_dump: dump enable # dump_step: iterations to dump # dump_mode: dump mode, which can be set to input, output, or all dump_config = DumpConfig(enable_dump=True, dump_path = "/home/HwHiAiUser/output", dump_step="0|5|10", dump_mode="all") session_config=tf.ConfigProto() config = NPURunConfig( dump_config=dump_config, session_config=session_config )
Collecting Dump Data with sess.run()
In sess.run() mode, set the dump parameters by setting the session configuration options enable_dump, dump_path, dump_step, and dump_mode.
For details about the preceding parameters, see the description of the corresponding API.
config = tf.ConfigProto() custom_op = config.graph_options.rewrite_options.custom_optimizers.add() custom_op.name = "NpuOptimizer" custom_op.parameter_map["use_off_line"].b = True # enable_dump: dump enable custom_op.parameter_map["enable_dump"].b = True # dump_path: dump path. Create the specified path in advance in the training environment (either in a container or on the host). The running user configured during installation must have the read and write permissions on this path. custom_op.parameter_map["dump_path"].s = tf.compat.as_bytes("/home/HwHiAiUser/output") # dump_step: iterations to dump custom_op.parameter_map["dump_step"].s = tf.compat.as_bytes("0|5|10") # dump_mode: dump mode, which can be set to input, output, or all custom_op.parameter_map["dump_mode"].s = tf.compat.as_bytes("all") config.graph_options.rewrite_options.remapping = RewriterConfig.OFF with tf.Session(config=config) as sess: print(sess.run(cost))
Viewing Dump Data
If data dump data is collected during training, a dump file is generated to the {dump_path}/{time}/{deviceid}/{model_name}/{model_id}/{data_index} directory, for example, /home/HwHiAiUser/output/20200808163566/0/ge_default_20200808163719_121/11/0. In addition, a GE graph file, for example, ge_proto_xxxxx_Bulid.txt, is generated in the same directory of the training script.
The dump data path and file naming rules are as follows:
- dump_path: dump path, for example, /home/HwHiAiUser/output.
- time: timestamp (for example, 20200317020343)
- deviceid: device ID
- model_name: submodel name. If the model_name directory contains more than one folder, dump data in the folder of the computational graph is used.
After the training script is executed, one or multiple GE graphs will be generated in the directory of the training script. Take ge_proto_*****_Build.txt as an example. To select the computational graph, view each GE graph, respectively. The GE graph with the IteratorV2, Iterator, or GetNext operator is the computational graph. Value of the name field in the computational graph is used as the name of the computational graph.
- model_id: submodel ID
- data_index: iterations to dump. If dump_step is specified, data_index equals dump_step. If not, data_index is indexed starting at 0 and is incremented by 1 with each dump.
- dump_file: formatted as {op_type}.{op_name}.{taskid}.{timestamp}
- Periods (.), forward slashes (/), backslashes (\), and spaces in model_name, op_type or op_name are replaced by underscores (_).
- In the multi-device training scenario where more than one Ascend AI Processor is used, since the processes are not started at the same time as defined in the training script, multiple timestamp directories will be generated.
- When the command is executed in a Docker, the generated data is stored in the Docker.
Analyzing Dump Data
Use the Model Accuracy Analyzer to analyze the operator accuracy. For details, see Model Accuracy Analyzer Instructions.