Session Configuration in sess.run Mode
When the training script is run on the Ascend AI Processor in sess.run mode, the following configuration options are supported.
Configuration Option |
Description |
---|---|
use_off_line |
Whether the training is performed on the Ascend AI Processor.
|
enable_data_pre_proc |
Data preprocessing enable.
|
iterations_per_loop |
Number of iterations per loop set by using set_iteration_per_loop in sess.run mode, that is, the number of iterations per training loop every sess.run() call on the device side. The parameter value must be the same as the value of iterations_per_loop set by set_iteration_per_loop for function verification. |
profiling_mode |
Profiling enable.
|
profiling_options |
Option (or options separated by colons) to be traced in profiling.
NOTE:
|
fp_point |
Required if training_trace is selected. This parameter specifies the start operator on the training network for forward propagation on the iteration trace, to record the start timestamp of forward propagation. Set the value to the name of the top operator in forward propagation. You can save the graph as a .pbtxt file by using tf.io.write_graph in the training script to obtain this name. |
bp_point |
Required if training_trace is selected. This parameter specifies the end operator on the training network for backward propagation on the iteration trace, to record the end timestamp of backward propagation. bp_point and fp_point can be used to calculate the forward and backward time, respectively. Set the value to the name of the bottom operator in backward propagation. You can save the graph as a .pbtxt file by using tf.io.write_graph in the training script to obtain this name. |
enable_dump |
Data dump enable
|
dump_path |
Dump path. This parameter is required when enable_dump or enable_dump_debug is set to True. Create the specified path in advance in the environment (either in a container or on the host) where training is performed. The running user configured during installation must have the read and write permissions on this path. The path can be an absolute path or a relative path relative to the path where the training script is executed.
|
dump_step |
Iterations to dump. Defaults to None, indicating that all iterations are dumped. Separate multiple iterations using vertical bars (|), for example, 0|5|10. You can also use hyphens (-) to specify an iteration range, for example, 0|3-5|10. |
dump_mode |
Dump mode.
|
enable_dump_debug |
Overflow detection enable.
|
dump_debug_mode |
Overflow detection mode.
|
precision_mode |
A string for the operator precision mode.
|
enable_reduce_precision |
The current version does not support this parameter. |
variable_format_optimize |
Variable format optimization enable.
To improve training efficiency, during the initialization of network variables, the variables are converted into a data format that is more suitable for running on the Ascend AI Processor (for example, NCHW to NC1HWC0 conversion). However, this function should be disabled in special scenarios. |
mix_compile_mode |
Mixed computing enable.
In fully offloaded mode, all computing operators are offloaded to the device side. As a supplement to the fully offloaded mode, the mixed computing mode allows some operators that cannot be offloaded to be executed online in the frontend framework, improving the Ascend AI Processor flexibility for adapting to TensorFlow. |
hcom_parallel |
Whether to enable the AllReduce gradient update and forward and backward parallel execution.
|
graph_memory_max_size |
Network static memory and maximum dynamic memory, which can be specified based on the network size. The value ranges from 0 to 256 x 1024 x 1024 x 1024 or 0 to 274877906944, in bytes. Due to Ascend AI Processor hardware restrictions, the sum of graph_memory_max_size and variable_memory_max_size cannot exceed 31 GB. If not set, 26 GB is used by default. |
variable_memory_max_size |
Variable memory, which can be specified based on the network size. The value ranges from 0 to 256 x 1024 x 1024 x 1024 or 0 to 274877906944, in bytes. Due to Ascend AI Processor hardware restrictions, the sum of graph_memory_max_size and variable_memory_max_size cannot exceed 31 GB. If not set, 5 GB is used by default. |
auto_tune_mode |
Auto Tune enable for tuning TBE operators during build to find the optimal performance configuration on the Ascend AI Processor. Example: auto_tune_mode = "RL,GA"; If this parameter is not specified, auto tuning is disabled. For details about the usage of the Auto Tune tool, see Auto Tune Tool Instructions. |
stream_max_parallel_num |
AI CPU and AI Core engine parallelism for parallel execution of AI CPU and AI Core operators. Example: "DNN_VM_TF:10,DNN_V100:1" DNN_VM_TF is the name of the AI CPU engine. In this example, the number of concurrent tasks of the AI CPU engine is 10. DNN_V100 is the name of the AI Core engine. In this example, the number of concurrent tasks of the AI Core engine is 1. The value range is [1, 13]. Defaults to 1. |
is_tailing_optimization |
Communication hangover optimization enable in distributed training scenarios to improve performance. By changing a computation dependency relationship, a computation operation that does not depend on the last AR (gradient aggregation fragment) is scheduled to be performed in parallel with the last AR, to optimize communication hangover.
This parameter must be used in pair with NPUOptimizer Constructor and the value must be the same as that of is_tailing_optimization in NPUOptimizer Constructor. |
graph_run_mode |
Graph run mode.
|
op_debug_level |
Operator debug enable.
|
enable_scope_fusion_passes |
Scope fusion pattern (or scope fusion patterns separated by commas) to take effect during build. Both built-in and custom scope fusion pattern files are classified into the following two types:
|
enable_exception_dump |
Whether to dump the inputs and outputs of error operators.
|
op_select_implmode |
Operator implementation mode select. Some operators built in the Ascend AI Processor can be implemented in either high-precision or high-performance mode.
|
optypelist_for_implmode |
List of operator types. The operators in the list use the mode specified by the OP_SELECT_IMPL_MODE parameter. This parameter is used in pair with the OP_SELECT_IMPL_MODE parameter, for example: Set op_select_implmode to high_precision. Set optypelist_for_implmode to Pooling. |
Example
Example in sess.run mode:
import tensorflow as tf from npu_bridge.estimator import npu_ops from npu_bridge.estimator.npu import npu_scope from tensorflow.core.protobuf.rewriter_config_pb2 import RewriterConfig X = tf.random_normal([2,]) Y = tf.random_normal([2,]) with npu_scope.without_npu_compile_scope(): pred = tf.add(tf.multiply(X, 1.), 0.) cost = tf.reduce_sum(tf.abs(pred-Y)) config = tf.ConfigProto() custom_op = config.graph_options.rewrite_options.custom_optimizers.add() custom_op.name = "NpuOptimizer" custom_op.parameter_map["use_off_line"].b = True custom_op.parameter_map["enable_data_pre_proc"].b = True custom_op.parameter_map["profiling_mode"].b = True custom_op.parameter_map["profiling_options"].s = tf.compat.as_bytes("task_trace") custom_op.parameter_map["precision_mode"].s = tf.compat.as_bytes("allow_mix_precision") custom_op.parameter_map["fp_point"].s = tf.compat.as_bytes("resnet_v1_50_1/conv1/Conv2D") custom_op.parameter_map["bp_point"].s = tf.compat.as_bytes("add_1") custom_op.parameter_map["enable_reduce_precision"].b = False custom_op.parameter_map["variable_format_optimize"].b = True custom_op.parameter_map["mix_compile_mode"].b = True custom_op.parameter_map["enable_dump"].b = True custom_op.parameter_map["dump_path"].s = tf.compat.as_bytes("/tmp/test") custom_op.parameter_map["dump_step"].s = tf.compat.as_bytes("0|5|10") custom_op.parameter_map["dump_mode"].s = tf.compat.as_bytes("all") custom_op.parameter_map["enable_dump_debug"].b = True custom_op.parameter_map["dump_debug_mode"].s = tf.compat.as_bytes("all") custom_op.parameter_map["hcom_parallel"].b = True custom_op.parameter_map["graph_memory_max_size"].s = tf.compat.as_bytes(str(26*1024 * 1024 * 1024)) custom_op.parameter_map["variable_memory_max_size"].s = tf.compat.as_bytes(str(5*1024 * 1024 * 1024)) custom_op.parameter_map["iterations_per_loop"].i = 10 custom_op.parameter_map["auto_tune_mode"].s = tf.compat.as_bytes("RL,GA") custom_op.parameter_map["stream_max_parallel_num"].s = tf.compat.as_bytes("DNN_VM_TF:10,DNN_V100:1") custom_op.parameter_map["is_tailing_optimization"].b = True custom_op.parameter_map["graph_run_mode"].i = 1 custom_op.parameter_map["op_debug_level"].i = 0 custom_op.parameter_map["enable_scope_fusion_passes"].s = tf.compat.as_bytes("ScopeLayerNormPass,ScopeClipBoxesPass") custom_op.parameter_map["enable_exception_dump"].i = 1 custom_op.parameter_map["op_select_implmode"].s = tf.compat.as_bytes("high_precision") custom_op.parameter_map["optypelist_for_implmode"].s = tf.compat.as_bytes("Pooling") config.graph_options.rewrite_options.remapping = RewriterConfig.OFF # Disable remapping. with tf.Session(config=config) as sess: print(sess.run(cost))