Overflow Detection
Application Scenario
For a large network, a large amount of data will be dumped during operator accuracy analysis. In addition, due to the randomness of the network, it is difficult to locate the operators with accuracy drop compared with those from third party frameworks. In this case, you can choose to enable the overflow detection function. Currently, the following three overflow detection modes are provided:
- aicore_overflow: detects AI Core operator overflow. With normal inputs, abnormal extreme outputs (such as float16 65500, 38400, and 51200) are detected. Once such fault is detected, analyze the cause of the overflow and modify the operator implementation based on the network requirements and operator logic.
- atomic_overflow: detects Atomic Add overflow. Atomic Add overflows are detected when data is moved from the UB to the external storage after AI Core computation.
- all: detects both AI Core operator overflow and Atomic Add overflow.
Based on the overflow detection result, locate the faulty operators, dump data of the specific faulty operators, and analyze the dump data to solve the accuracy drop issue.
Data dump and overflow detection are mutually exclusive.
Detecting Overflow with Estimator
In Estimator mode, use dump_config in NPURunConfig to set the overflow detection mode. Before creating NPURunConfig, instantiate a DumpConfig class.
from npu_bridge.estimator.npu.npu_config import NPURunConfig from npu_bridge.estimator.npu.npu_config import DumpConfig # dump_path: dump path. Create the specified path in advance in the training environment (either in a container or on the host). The running user configured during installation must have the read and write permissions on this path. # enable_dump_debug: overflow detection enable # dump_debug_mode: overflow detection mode select, which can be all, aicore_overflow, or atomic_overflow dump_config = DumpConfig(enable_dump_debug = True, dump_path = "/home/HwHiAiUser/output", dump_debug_mode = "all" ) session_config=tf.ConfigProto() config = NPURunConfig(dump_config=dump_config, session_config=session_config)
Detecting Overflow with sess.run()
In sess.run mode, set the overflow detection mode by setting the session configuration options dump_path, enable_dump_debug, and dump_debug_mode.
config = tf.ConfigProto() custom_op = config.graph_options.rewrite_options.custom_optimizers.add() custom_op.name = "NpuOptimizer" custom_op.parameter_map["use_off_line"].b = True # dump_path: dump path. Create the specified path in advance in the training environment (either in a container or on the host). The running user configured during installation must have the read and write permissions on this path. custom_op.parameter_map["dump_path"].s = tf.compat.as_bytes("/home/HwHiAiUser/output") # enable_dump_debug: overflow detection enable custom_op.parameter_map["enable_dump_debug"].b = True # dump_debug_mode: overflow detection mode select, which can be all, aicore_overflow, or atomic_overflow custom_op.parameter_map["dump_debug_mode"].s = tf.compat.as_bytes("all") config.graph_options.rewrite_options.remapping = RewriterConfig.OFF # Disable remapping. with tf.Session(config=config) as sess: print(sess.run(cost))
Viewing Overflowed Data
If overflowed data is collected during training, an overflowed data file is generated to the {dump_path}/{time}/{deviceid}/{model_id}/{data_index} directory, for example, /home/HwHiAiUser/output/20200808163566/0/11/0.
If no overflowed data is collected during training, that is, no overflow occurs, the preceding directory is not generated.
The fields in the dump data path and file are described as follows:
- dump_path: user-defined path for storing overflowed data, for example, /home/HwHiAiUser/output.
- time: timestamp (for example, 20200808163566)
- deviceid: device ID
- model_id: subnetwork ID
- data_index: iterations to detect overflow
- dump_file: formatted as OpDebug.Node_Opdebug.{taskid}.{timestamp}. Note that taskid is not that of the operator.
- Periods (.), forward slashes (/), backslashes (\), and spaces in model_name, op_type or op_name are replaced by underscores (_).
- In the multi-device training scenario where more than one Ascend AI Processor is used, since the processes are not started at the same time as defined in the training script, multiple timestamp directories will be generated.
- When the command is executed in a Docker, the generated data is stored in the Docker.
Parsing Overflowed Data
Since the generated overflowed data is in binary format, you need to convert the binary file into a readable form, such as a .json file.
- Upload the overflow data file OpDebug.Node_Opdebug.{taskid}.{timestamp} to the Toolkit installation environment.
You are advised to go to the data_index directory with the minimum value, and use the dump file with the minimum {timestamp} for data parsing.
- Go to the directory where the parse script is stored. The following uses /home/HwHiAiUser/Ascend/ascend-toolkit/latest as the Toolkit installation path as an example:
cd /home/HwHiAiUser/Ascend/ascend-toolkit/latest/toolkit/tools/operator_cmp/compare
- Run the parse command:
python3.7.5 msaccucmp.pyc convert -d /home/HwHiAiUser/opdebug/Opdebug.Node_OpDebug.59.1597922031178434 -out /home/HwHiAiUser/result
The key options are described as follows:
- -d: directory of the overflowed data, including the file name
- -out: directory of the parsed result. If not specified, it uses the current directory.
- Find the parsed result as follows.
{ "DHA Atomic Add": { "model_id": 0, "stream_id": 0, "task_id": 0, "task_type": 0, "pc_start": "0x0", "para_base": "0x0", "status": 0 }, "L2 Atomic Add": { "model_id": 0, "stream_id": 0, "task_id": 0, "task_type": 0, "pc_start": "0x0", "para_base": "0x0", "status": 0 }, "AI Core": { "model_id": 514, "stream_id": 563, "task_id": 57, "task_type": 0, "pc_start": "0x1008005b0000", "para_base": "0x100800297000", "kernel_code": "0x1008005ae000", "block_idx": 1, "status": 32 } }
If both AI Core operator overflow detection and Atomic Add overflow detection are enabled, only the earliest overflow record is displayed.
In the preceding example, the earliest overflow record is an AI Core operator overflow.
Parameter description:
- model_id: ID of the model where the overflow operator is located.
- stream_id: ID of the stream where the overflow operator is located.
- task_id: task ID of the overflow operator.
- task_type: task type of the overflow operator.
- pc_start: start of the code program of the overflow operator.
- para_base: parameter start address of the overflow operator.
- kernel_code: start of the code program of the overflow operator, which is equivalent to pc_start.
- block_idx: block ID of the overflow operator.
- status: status of the AI Core status register, including the overflow information.
Based on the stream_id and task_id fields, you can locate the model containing the overflow operator from the Runtime INFO-level log.
In addition, you can locate the block where overflow occurs based on block_idx and obtain the cause from status.
Status Reference
- The status field that reflects the AI Core operator overflow detection result is in decimal format. You need to convert it into hexadecimal format before locating the fault.
For example, assume that the value of status is 272. The hexadecimal equivalent of the value is 0x00000110. Therefore, the error cause is 0x00000010+0x00000100.
- 0x00000008: inversion overflow of the minimum negative sign bit of a signed integer
- 0x00000010: integer addition, subtraction, multiplication, or multiplication overflow
- 0x00000020: floating-point computation overflow
- 0x00000080: negative input for floating-point to unsigned conversion
- 0x00000100: FP32 to FP16 conversion or 32-bit signed integer to FP16 conversion overflow
- 0x00000400: Cube accumulation overflow
Note: The preceding floating-point errors correspond to the hexadecimal bits, which might lead to combinations of floating-point errors.
- The status field that reflects the DHA Atomic Add overflow detection result is in decimal format. You need to convert it into a binary number and convert bits 8 to 15 to a hexadecimal number before locating the fault.
For example, assume that the value of status is 2546. The binary equivalent of the value is 100111110010. Convert bits 8 to 15 1001 to obtain the hexadecimal number 0x9, that is, the error cause.
- 0x9: atomic overflow
- 0xA: atomic underflow
- 0xB: atomic srcnnan (invalid source operand)
- 0xC: atomic dstnan (invalid destination operand)
- 0xD: atomic bothnan (invalid source operand and destination operand)
- The status field that reflects the L2 Atomic Add overflow detection result is in decimal format. You need to convert it into a binary number before locating the fault.
For example, assume that the value of status is 2546. The binary equivalent of the value is 100111110010, and bits 16 to 18 make 000. It can be determined that no error occurs.
- 001: atomic overflow
- 010: atomic underflow
- 011: atomic srcnnan (invalid source operand)
- 100: atomic dstnan (invalid destination operand)
- 101: atomic bothnan (invalid source operand and destination operand)