Preparing Data
- If an AI Core error occurs during training, perform the following steps to configure the op_debug_level and enable_exception_dump parameters:
- In Estimator mode, set op_debug_level and enable_exception_dump as follows.
from npu_bridge.estimator.npu.npu_config import NPURunConfig from npu_bridge.estimator.npu.npu_config import DumpConfig session_config=tf.ConfigProto() config = NPURunConfig( op_debug_level = 2, //Enable operator debug. session_config=session_config, enable_exception_dump=1 //Dump the inputs and outputs of the error operator to the script execution directory. Dynamic-shape operators cannot be dumped. )
- In sess.run mode, set op_debug_level and enable_exception_dump as follows.
import tensorflow as tf from npu_bridge.estimator import npu_ops from tensorflow.core.protobuf.rewriter_config_pb2 import RewriterConfig config = tf.ConfigProto() custom_op = config.graph_options.rewrite_options.custom_optimizers.add() custom_op.name = "NpuOptimizer" custom_op.parameter_map["use_off_line"].b = True custom_op.parameter_map["enable_exception_dump"].i = 1 //Dump the inputs and outputs of the error operators to the script execution directory. Dynamic-shape operators cannot be dumped. custom_op.parameter_map["op_debug_level"].i = 2 //Enable operator debug. config.graph_options.rewrite_options.remapping = RewriterConfig.OFF # Disable remapping. with tf.Session(config=config) as sess: print(sess.run(cost))
Table 6-1 Value range of the op_debug_level parameterValue
Description
0 (default)
Disables operator debug.
1
Enables operator debug and generates a TBE instruction mapping file. In this case, an operator CCE file (*.cce) and a Python-CCE mapping file (*_loc.json), and operator .o and .json files are generated in the kernel_meta folder in the training script execution directory. You can locate the AI Core error by using the line numbers in the CCE code and TBE code of the error operator.
2
Enables operator debug and generates a TBE instruction mapping file. In this case, an operator CCE file (*.cce) and a Python-CCE mapping file (*_loc.json), and operator .o and .json files are generated in the kernel_meta folder in the training script execution directory, and the build optimization is disabled by enabling the CCE compiler -O0-g. You can locate the AI Core error by using the line numbers in the CCE code and TBE code of the error operator.
Before performing training again, ensure that the /var/log/npu/slog directory contains only logs of the current project. Otherwise, data parsing fails. Back up important logs in advance.
- In Estimator mode, set op_debug_level and enable_exception_dump as follows.
- Find the instruction mapping file and error operator dump file generated to the training execution directory.